feat!: support void primitive type in schema#1858
feat!: support void primitive type in schema#1858cdelmonte-zg wants to merge 15 commits intodelta-io:mainfrom
Conversation
|
The semver check fails because adding a variant to PrimitiveType and a field to EngineSchemaVisitor are technically breaking changes. I'm happy to adjust the approach if maintainers prefer a different strategy. |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1858 +/- ##
==========================================
+ Coverage 88.31% 88.33% +0.01%
==========================================
Files 173 174 +1
Lines 57210 57482 +272
Branches 57210 57482 +272
==========================================
+ Hits 50526 50777 +251
- Misses 4743 4754 +11
- Partials 1941 1951 +10 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@cdelmonte-zg thank you for opening the PR. My concern here is we don't understand the end-to-end implications of simply being able to include Null's in schema and what would break. I think we probably should at a minimum
I would feel better if we mirrored how this is handled in the spark-connector. By dropping the columns for reads. |
|
Thank you for the detailed feedback @emkornfield. You're right, the original issue #739 explicitly recommends dropping void columns on reads, as stated in the protocol. My PR currently only solves the parsing side (so the kernel no longer crashes on void schemas), but doesn't implement the drop behavior. I'll update this PR to mirror the Spark connector approach:
Will follow up with the changes soon. |
Mirror Spark connector behavior for void type handling: - Drop void columns at scan time so consumers never see them - Reject void nested inside Array/Map with schema validation error - Add void_utils module with early-exit visitor for validation - Wire validation into TableConfiguration on schema load - Add integration tests with a Delta table containing a void column Addresses review feedback in delta-io#1858. Closes delta-io#739
d0a91eb to
8ed825e
Compare
|
Apologies, I'm having some offline conversations with other maintainers to better understand the approach we should follow here. |
| { | ||
| "name": "void_col", | ||
| "type": "void", | ||
| "nullable": true, |
There was a problem hiding this comment.
Should we also add a test for nullable=false? Which is not allowed
There was a problem hiding this comment.
Good point. VOID is inherently always null, so nullable=false is semantically contradictory in this case. I'll add a test to cover it.
Note: depending on the outcome of the discussion with @ZiyaZa about the read-time strategy (drop vs. return nulls), the exact handling here may change. I'll adjust accordingly in the next iteration.
| Self::decimal_from_bytes(b.min_bytes_opt(), *d)? | ||
| } | ||
| (Decimal(..), _) => return None, | ||
| (Void, _) => return None, |
There was a problem hiding this comment.
Practically, this is unreachable code, since void columns are never materialized in Parquet files.
There was a problem hiding this comment.
You're right, VOID columns are never written to Parquet files, so this branch is effectively unreachable in practice. The match arm exists for exhaustiveness of the type handling.
I'll add a clarifying comment, and I can remove the artificial test I introduced for this branch since it doesn't correspond to a real execution path.
There was a problem hiding this comment.
Not sure if there is a potential edge case for data that was written with void and then imported into delta afterwards. In fact if we wanted to be thorough we should probably be testing that a column with a non-void type an read a file with a void column present with a different physical type (this might expose bugs in arrow's implementation).
There was a problem hiding this comment.
But this is probably an orthogonal issue.
There was a problem hiding this comment.
Seems better to not even try to physically read a void column. At best it would come back NULL anyway, and at worst it trips over bugs and/or bad files?
There was a problem hiding this comment.
Agreed, orthogonal to this PR.
| fn transform_map_key(&mut self, etype: &'a DataType) -> Option<Cow<'a, DataType>> { | ||
| if *etype == DataType::VOID { | ||
| self.0 = Some("Void type is not allowed as a map key type"); | ||
| return Some(Cow::Borrowed(etype)); | ||
| } | ||
| self.transform(etype) | ||
| } |
There was a problem hiding this comment.
Shouldn't we catch this much earlier? E.g. when creating the MapType with a Void key?
There was a problem hiding this comment.
Fair point. Today MapType::new accepts any DataType without validation, so catching this there would require introducing validation into the constructor.
That said, the schema is currently deserialized from Delta metadata via serde, which bypasses MapType::new entirely. So adding validation in the constructor would neither block reads nor catch invalid cases during schema parsing.
If we want to enforce this constraint consistently, it would require a dedicated validation step rather than relying on the constructor.
Based on @ZiyaZa's proposal, the direction may shift toward rejecting void-in-map only at write time (not at schema load), so that reads of existing tables with such schemas can still succeed.
I'll revisit this once the overall validation strategy is settled.
|
Hey, thanks for this PR. I've been working on improving VOID support in Delta Spark connector, as currently dropping it on reads is not ideal. Most of the times it results in failures, but even when it works, we incorrectly do not return these columns in the result. Hence it is better if we don't drop it on reads. which is what I'm working on. In an ideal world, we'd write the VOID columns into Parquet files as UNKNOWN, however it's tricky to make this change because that's not how Delta behaves today. There are existing users of VOID in Delta, and we can't start writing VOID columns into Parquet files and expect them to be present there when reading the files back. So the best middle ground option is to continue not writing VOID into Parquet files, but still return them in read queries, which I've been working on. I've already made some changes in Spark Parquet reader that's required for this to work correctly in Delta Spark connector (See apache/spark#52922). It's included in Spark 4.1 release, and once Delta drops support for older Spark versions, I'll also make the necessary changes available in Delta Spark connector. Therefore, it'd be really good if we can align the behavior in this PR with the behavior we'll have in Spark connector after a short while. The upcoming behavior of Delta Spark connector is:
Because we are not writing VOID into Parquet, this results in the following limitations that we should catch and throw a proper error:
Note that for these cases, the write operation should fail with a proper error, but we can still perform metadata-only operations or read from such tables. For example, it's possible to create such an empty table, and then evolve the type with schema evolution or add other columns to fix the schema. How does this behavior sound to you? Do you think you can change your PR to behave this way for alignment with how Delta Spark connector will behave? From what I understand, it seems originally your PR had a similar behavior, so hopefully it is not too much work. |
Thank you @ZiyaZa for the detailed context on the upcoming Spark connector changes, this is extremely helpful for getting the semantics right. The approach you describe makes a lot of sense:
My PR currently mirrors the existing I'm happy to rework the PR accordingly. Concretely, this would mean:
Before proceeding with the rework, I'll wait for @emkornfield to confirm the direction so we're all aligned. |
|
@cdelmonte-zg yes, this approach also seems like the right thing to do. Appreciate your flexibility here. |
Mirror Spark connector behavior for void type handling: - Drop void columns at scan time so consumers never see them - Reject void nested inside Array/Map with schema validation error - Add void_utils module with early-exit visitor for validation - Wire validation into TableConfiguration on schema load - Add integration tests with a Delta table containing a void column Addresses review feedback in delta-io#1858. Closes delta-io#739
a00d146 to
25e2a9b
Compare
|
Updated based on all review feedback. Key changes since last review:
Ready for re-review. |
| Ok(()) | ||
| } | ||
|
|
||
| // Fokko review: void is inherently always-null, so nullable=false is semantically |
There was a problem hiding this comment.
Lets not leave peoples names in the comments.
There was a problem hiding this comment.
Fixed: removed all person names from code comments.
| // Arrow conversion still works — produces Null type | ||
| let arrow_field = ArrowField::try_from_kernel(&field)?; | ||
| assert_eq!(arrow_field.data_type(), &ArrowDataType::Null); | ||
| assert!(!arrow_field.is_nullable()); |
There was a problem hiding this comment.
i assume we have a test at a higher level in delta that will error on this?
There was a problem hiding this comment.
This is tied to the nullable=false question below (line 2211). For now we tolerate it on reads. If we decide to reject it, we'll add validation and update this test accordingly.
There was a problem hiding this comment.
Let's add a comment there then?
There was a problem hiding this comment.
Just watch out -- arrow's NullArray::is_null method always returns false. Have to request Array::logical_nulls instead if you need accurate nulls. Several other data types fall in the same category. I believe this is the first time we would introduce such a type into kernel, so it could break all kinds of null checks.
There was a problem hiding this comment.
Done: added comment explaining why we tolerate nullable=false for void on reads.
Good to know, thanks. In this PR we don't rely on is_null() for NullArray, we check the Arrow type directly (ArrowDataType::Null). But worth keeping in mind if future code needs null checks on this type.
| fn test_void_field_in_struct() -> DeltaResult<()> { | ||
| // A struct schema with a void column should convert to Arrow with a Null field | ||
| let schema = StructType::try_new([ | ||
| StructField::nullable("id", DataType::INTEGER), |
There was a problem hiding this comment.
any reason integer column is included here?
There was a problem hiding this comment.
The integer column is intentional. It verifies that void works correctly alongside normal types in a struct, which is the realistic use case. The isolated void roundtrip is covered by test_void_type_roundtrip.
I believe it can cause crashes for readers that can't deal with UNKNOWN correctly. Parquet spec has UNKNOWN since 2021 (and NULL even before that), but it's not widely supported as far as I know. Parquet-java only added it in 2025. Spark only started supporting UNKNOWN with my PR there recently. If we write it out as UNKNOWN, then for reads:
So the only harm that I can think of is if there is some other reader that doesn't understand UNKNOWN and will start crashing with it being present in the file. |
I couldn't find anything about a NULL type in the parquet spec? And Google AI overview says parquet doesn't have such a type? |
Wouldn't such a reader anyway not understand the Delta VOID type in the first place, and fail during analysis? Long before execution where parquet becomes relevant? |
I think UNKNOWN replaced NULL, looking at the git history, see apache/parquet-format@4bddbad |
I don't think so, because VOID and UNKNOWN are kind of different. In the Delta Protocol, it doesn't say anything regarding UNKNOWN type, and it is very vague when mentioning
So, from this, readers must understand the VOID type, because it is in the protocol. But they may or may not understand UNKNOWN. If we write files with UNKNOWN, we require readers to know how to read it from Parquet files. You could also argue that "Behavior is undefined" means queries involving VOID can also crash, so breaking readers is not a problem. When I make the changes in Spark connector, I'll also make some changes to Protocol to specify how VOID needs to be handled. But in the current state, it's all very vague and can be understood both ways. |
|
@emkornfield Hi, a gentle ping on this PR. I've addressed the requested changes and simplified the tests as suggested. Would appreciate a re-review when you have time. Thanks! |
|
Reviews aside, it looks like there are significant merge conflicts that could be resolved? |
Mirror Spark connector behavior for void type handling: - Drop void columns at scan time so consumers never see them - Reject void nested inside Array/Map with schema validation error - Add void_utils module with early-exit visitor for validation - Wire validation into TableConfiguration on schema load - Add integration tests with a Delta table containing a void column Addresses review feedback in delta-io#1858. Closes delta-io#739
e3ea3f4 to
fd1c4c3
Compare
|
Done, rebased onto main, ready for review. |
Mirror Spark connector behavior for void type handling: - Drop void columns at scan time so consumers never see them - Reject void nested inside Array/Map with schema validation error - Add void_utils module with early-exit visitor for validation - Wire validation into TableConfiguration on schema load - Add integration tests with a Delta table containing a void column Addresses review feedback in delta-io#1858. Closes delta-io#739
fd1c4c3 to
44dd1fc
Compare
|
PR title does not match the required pattern. Please ensure you follow the conventional commits spec. Your title should start with Title: |
Add support for the `void` data type as defined in the Delta protocol. Existing tables may have void columns; the kernel now parses them and maps them to Arrow's Null type. Closes delta-io#739
Mirror Spark connector behavior for void type handling: - Drop void columns at scan time so consumers never see them - Reject void nested inside Array/Map with schema validation error - Add void_utils module with early-exit visitor for validation - Wire validation into TableConfiguration on schema load - Add integration tests with a Delta table containing a void column Addresses review feedback in delta-io#1858. Closes delta-io#739
- Replace top-level-only void filter with recursive SchemaTransform
that drops void fields at any struct nesting level (Spark parity)
- Add drop_void_fields() using SchemaTransform in void_utils
- Add unit tests for nested and deeply nested void drop
- Add integration test for void inside nested struct
- Add test for explicit projection with void column
- Cover parquet row group skipping Void branches (min/max stats)
- Cover arrow expression Void => NullBuilder branch
…time Rework void type handling per ZiyaZa/emkornfield review: - Reads return void columns as null (no longer dropped from schema) - Write-time validation rejects void in Array/Map, all-void structs/tables - Metadata-only operations always succeed regardless of void placement - Physical schema recursively strips void fields (including nested structs) - Fix VOID↔Null mapping in ensure_data_types for expression evaluation Addresses Fokko's inline reviews: - Add test for nullable=false + void (tolerated on reads) - Document unreachable void stats branches, remove artificial tests
…time Rework void type handling per ZiyaZa/emkornfield review: - Reads return void columns as null (no longer dropped from schema) - Write-time validation rejects void in Array/Map, all-void structs/tables - Metadata-only operations always succeed regardless of void placement - Physical schema recursively strips void fields (including nested structs) - Fix VOID↔Null mapping in ensure_data_types for expression evaluation - Fix NullArray crash in compute_nested_null_masks (null buffer propagation) Addresses Fokko's inline reviews: - Add test for nullable=false + void (tolerated on reads) - Document unreachable void stats branches, remove artificial tests
…n void assertions Add a targeted unit test for compute_nested_null_masks with a NullArray child and non-empty parent nulls, verifying the early-return fix doesn't regress. Strengthen void-in-array and void-in-map read tests to assert null_count() == len() (entire column is NULL, not just nullable). Refine comments to clarify semantics: columns absent from Parquet schema are synthesized as null columns at read time.
…unt quirk In the compute_nested_null_masks unit test, assert that NullArray reports null_count()==0 (Arrow has no null bitmap for NullArray) and document the quirk. In void-in-array and void-in-map read tests, drop incorrect DataType::Null assertions — synthesized columns preserve the container type (List<Null>, Map<String,Null>) matching the Delta schema, not a bare NullArray — and clarify this in comments.
…n-map columns Verify that void-in-array materializes as List<Null> and void-in-map as Map<Utf8,Null>, not a bare NullArray or a different container. Guards against regressions where the synthesized column type silently changes.
generate_logical_to_physical() only dropped top-level void fields, while the physical schema stripped them recursively via strip_void_from_field. This mismatch would cause the expression evaluator to pass void sub-fields that the physical schema did not expect. Add build_void_stripping_transform() to construct nested Transform expressions that drop void at all struct nesting levels. Also add: unit tests for strip_void_from_field (passthrough, mixed void, depth 3, metadata preservation), IS NOT NULL predicate on void column, write_rejects_void_in_map_key integration test, and E2E test verifying the transform drops nested void fields.
Separate the void-detection pre-scan from the Transform construction in build_void_stripping_transform(). The new contains_void() does a pure boolean traversal without allocating intermediate Transforms, reducing cost on deeply nested struct schemas.
- Remove person names from code comments - Enforce strict empty-string parsing for void scalars - Use top-level import for validate_schema_for_write - Fix Parquet comment (void not written by Delta, not a Parquet limitation) - Name the error_message field in CheckVoidInComplexTypes - Refactor StripVoidFields and CheckAllVoidStructs to use SchemaTransform (local structs, following MakePhysical pattern) - Unify "all fields are void" error message for top-level and nested - Fix read test comment (void columns are read as null, not dropped) - Parametrize void-in-complex-type validation tests with rstest
Address review feedback by parametrizing void_utils tests with rstest (validation, write rejection, and strip_void cases) and extracting a scan_with_void_schema helper to reduce duplicated setup in read tests.
With void type now supported in schema deserialization, the _snapshot workloads in void_001..void_007 pass. The _read_all workloads still fail due to a separate Arrow/parquet reader limitation (cannot annotate Unknown logical type from BOOLEAN physical type), so narrow the expected failure patterns accordingly.
44dd1fc to
ba1f830
Compare
Add support for the
voiddata type as defined in the Delta protocol, enabling reads ofexisting tables that contain void columns (Spark parity with SchemaUtils.dropNullTypeColumns).
Voidvariant toPrimitiveType; handle schema conversion (ArrowNull),parse_scalar, parquet stats, arrow expressions, and FFI schema visitor.(
TableConfiguration::try_new), matching Spark's AnalysisException behavior.level in
ScanBuilder::build, so engines never see them in the scan schema.void-column/), in-memory integration tests(SELECT *, explicit projection, nested struct void), validation tests (array/map rejection),
and unit coverage for parquet row group skipping and arrow expression branches.
Semver note: breaking due to new
PrimitiveTypevariant and newvisit_voidfield onEngineSchemaVisitor(FFI).Closes #739