Skip to content

feat!: support void primitive type in schema#1858

Open
cdelmonte-zg wants to merge 15 commits intodelta-io:mainfrom
cdelmonte-zg:feat/739-void-type
Open

feat!: support void primitive type in schema#1858
cdelmonte-zg wants to merge 15 commits intodelta-io:mainfrom
cdelmonte-zg:feat/739-void-type

Conversation

@cdelmonte-zg
Copy link
Copy Markdown

@cdelmonte-zg cdelmonte-zg commented Feb 16, 2026

Add support for the void data type as defined in the Delta protocol, enabling reads of
existing tables that contain void columns (Spark parity with SchemaUtils.dropNullTypeColumns).

  1. Parse void — add Void variant to PrimitiveType; handle schema conversion (Arrow Null),
    parse_scalar, parquet stats, arrow expressions, and FFI schema visitor.
  2. Validate no void in Array/Map — reject void nested inside Array/Map at schema load time
    (TableConfiguration::try_new), matching Spark's AnalysisException behavior.
  3. Drop void fields recursively during scan — remove void fields from structs at any nesting
    level in ScanBuilder::build, so engines never see them in the scan schema.
  4. Tests — fixture-based integration test (void-column/), in-memory integration tests
    (SELECT *, explicit projection, nested struct void), validation tests (array/map rejection),
    and unit coverage for parquet row group skipping and arrow expression branches.

Semver note: breaking due to new PrimitiveType variant and new visit_void field on
EngineSchemaVisitor (FFI).

Closes #739

@github-actions github-actions Bot added the breaking-change Public API change that could cause downstream compilation failures. Requires a major version bump. label Feb 16, 2026
@cdelmonte-zg
Copy link
Copy Markdown
Author

The semver check fails because adding a variant to PrimitiveType and a field to EngineSchemaVisitor are technically breaking changes. I'm happy to adjust the approach if maintainers prefer a different strategy.

@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 19, 2026

Codecov Report

❌ Patch coverage is 89.28571% with 30 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.33%. Comparing base (acc6733) to head (44dd1fc).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
kernel/src/transaction/mod.rs 80.00% 6 Missing and 2 partials ⚠️
kernel/src/engine/arrow_conversion/mod.rs 88.23% 0 Missing and 6 partials ⚠️
kernel/src/transaction/stats_verifier.rs 0.00% 6 Missing ⚠️
kernel/src/schema/void_utils.rs 96.52% 3 Missing and 1 partial ⚠️
kernel/src/engine/parquet_row_group_skipping.rs 0.00% 2 Missing ⚠️
ffi/src/expressions/kernel_visitor.rs 0.00% 1 Missing ⚠️
ffi/src/schema.rs 0.00% 1 Missing ⚠️
kernel/src/expressions/scalars.rs 85.71% 0 Missing and 1 partial ⚠️
test-utils/src/table_builder.rs 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1858      +/-   ##
==========================================
+ Coverage   88.31%   88.33%   +0.01%     
==========================================
  Files         173      174       +1     
  Lines       57210    57482     +272     
  Branches    57210    57482     +272     
==========================================
+ Hits        50526    50777     +251     
- Misses       4743     4754      +11     
- Partials     1941     1951      +10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@emkornfield
Copy link
Copy Markdown
Collaborator

@cdelmonte-zg thank you for opening the PR. My concern here is we don't understand the end-to-end implications of simply being able to include Null's in schema and what would break. I think we probably should at a minimum

  1. We need integration tests for how this would work with reads (included pruning).
  2. I also think there are some undocumented cases where null can't occur according the spark connector (nested in Array/Maps).

I would feel better if we mirrored how this is handled in the spark-connector. By dropping the columns for reads.

@cdelmonte-zg
Copy link
Copy Markdown
Author

Thank you for the detailed feedback @emkornfield.

You're right, the original issue #739 explicitly recommends dropping void columns on reads, as stated in the protocol. My PR currently only solves the parsing side (so the kernel no longer crashes on void schemas), but doesn't implement the drop behavior.

I'll update this PR to mirror the Spark connector approach:

  1. Drop void columns on reads: filter out void fields in the scan path, so consumers never see them
  2. Block void nested in Array/Map: reject these cases explicitly with an error
  3. Integration tests: read test with a Delta table containing a void column, verifying the column is dropped and other columns read correctly

Will follow up with the changes soon.

cdelmonte-zg added a commit to cdelmonte-zg/delta-kernel-rs that referenced this pull request Feb 22, 2026
Mirror Spark connector behavior for void type handling:
  - Drop void columns at scan time so consumers never see them
  - Reject void nested inside Array/Map with schema validation error
  - Add void_utils module with early-exit visitor for validation
  - Wire validation into TableConfiguration on schema load
  - Add integration tests with a Delta table containing a void column

Addresses review feedback in delta-io#1858.

Closes delta-io#739
@cdelmonte-zg cdelmonte-zg changed the title feat: support void primitive type in schema Draft: feat: support void primitive type in schema Feb 22, 2026
@cdelmonte-zg cdelmonte-zg changed the title Draft: feat: support void primitive type in schema feat: support void primitive type in schema Feb 22, 2026
@emkornfield
Copy link
Copy Markdown
Collaborator

Apologies, I'm having some offline conversations with other maintainers to better understand the approach we should follow here.

{
"name": "void_col",
"type": "void",
"nullable": true,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also add a test for nullable=false? Which is not allowed

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. VOID is inherently always null, so nullable=false is semantically contradictory in this case. I'll add a test to cover it.

Note: depending on the outcome of the discussion with @ZiyaZa about the read-time strategy (drop vs. return nulls), the exact handling here may change. I'll adjust accordingly in the next iteration.

Self::decimal_from_bytes(b.min_bytes_opt(), *d)?
}
(Decimal(..), _) => return None,
(Void, _) => return None,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Practically, this is unreachable code, since void columns are never materialized in Parquet files.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, VOID columns are never written to Parquet files, so this branch is effectively unreachable in practice. The match arm exists for exhaustiveness of the type handling.

I'll add a clarifying comment, and I can remove the artificial test I introduced for this branch since it doesn't correspond to a real execution path.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if there is a potential edge case for data that was written with void and then imported into delta afterwards. In fact if we wanted to be thorough we should probably be testing that a column with a non-void type an read a file with a void column present with a different physical type (this might expose bugs in arrow's implementation).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this is probably an orthogonal issue.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems better to not even try to physically read a void column. At best it would come back NULL anyway, and at worst it trips over bugs and/or bad files?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, orthogonal to this PR.

Comment on lines +61 to +67
fn transform_map_key(&mut self, etype: &'a DataType) -> Option<Cow<'a, DataType>> {
if *etype == DataType::VOID {
self.0 = Some("Void type is not allowed as a map key type");
return Some(Cow::Borrowed(etype));
}
self.transform(etype)
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we catch this much earlier? E.g. when creating the MapType with a Void key?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. Today MapType::new accepts any DataType without validation, so catching this there would require introducing validation into the constructor.

That said, the schema is currently deserialized from Delta metadata via serde, which bypasses MapType::new entirely. So adding validation in the constructor would neither block reads nor catch invalid cases during schema parsing.
If we want to enforce this constraint consistently, it would require a dedicated validation step rather than relying on the constructor.

Based on @ZiyaZa's proposal, the direction may shift toward rejecting void-in-map only at write time (not at schema load), so that reads of existing tables with such schemas can still succeed.

I'll revisit this once the overall validation strategy is settled.

@ZiyaZa
Copy link
Copy Markdown

ZiyaZa commented Feb 25, 2026

Hey, thanks for this PR. I've been working on improving VOID support in Delta Spark connector, as currently dropping it on reads is not ideal. Most of the times it results in failures, but even when it works, we incorrectly do not return these columns in the result. Hence it is better if we don't drop it on reads. which is what I'm working on. In an ideal world, we'd write the VOID columns into Parquet files as UNKNOWN, however it's tricky to make this change because that's not how Delta behaves today. There are existing users of VOID in Delta, and we can't start writing VOID columns into Parquet files and expect them to be present there when reading the files back.

So the best middle ground option is to continue not writing VOID into Parquet files, but still return them in read queries, which I've been working on. I've already made some changes in Spark Parquet reader that's required for this to work correctly in Delta Spark connector (See apache/spark#52922). It's included in Spark 4.1 release, and once Delta drops support for older Spark versions, I'll also make the necessary changes available in Delta Spark connector. Therefore, it'd be really good if we can align the behavior in this PR with the behavior we'll have in Spark connector after a short while. The upcoming behavior of Delta Spark connector is:

  • VOID never gets to Parquet files upon writes (The Parquet schema won't contain any VOID columns)
  • VOID will be present in Delta metadata schema
  • Reads will generate null values on the fly, as the VOID columns will be missing from Parquet files, but still defined in Delta schema

Because we are not writing VOID into Parquet, this results in the following limitations that we should catch and throw a proper error:

  • We can't write data into a table with VOID in an array/map.
  • We can't write data into a table with only VOID columns, nothing else.
  • We can't write data into a table that has a struct with only VOID fields, nothing else.

Note that for these cases, the write operation should fail with a proper error, but we can still perform metadata-only operations or read from such tables. For example, it's possible to create such an empty table, and then evolve the type with schema evolution or add other columns to fix the schema.

How does this behavior sound to you? Do you think you can change your PR to behave this way for alignment with how Delta Spark connector will behave? From what I understand, it seems originally your PR had a similar behavior, so hopefully it is not too much work.

@cdelmonte-zg
Copy link
Copy Markdown
Author

Hey, thanks for this PR. I've been working on improving VOID support in Delta Spark connector, as currently dropping it on reads is not ideal. Most of the times it results in failures, but even when it works, we incorrectly do not return these columns in the result. Hence it is better if we don't drop it on reads. which is what I'm working on. In an ideal world, we'd write the VOID columns into Parquet files as UNKNOWN, however it's tricky to make this change because that's not how Delta behaves today. There are existing users of VOID in Delta, and we can't start writing VOID columns into Parquet files and expect them to be present there when reading the files back.

So the best middle ground option is to continue not writing VOID into Parquet files, but still return them in read queries, which I've been working on. I've already made some changes in Spark Parquet reader that's required for this to work correctly in Delta Spark connector (See apache/spark#52922). It's included in Spark 4.1 release, and once Delta drops support for older Spark versions, I'll also make the necessary changes available in Delta Spark connector. Therefore, it'd be really good if we can align the behavior in this PR with the behavior we'll have in Spark connector after a short while. The upcoming behavior of Delta Spark connector is:

* VOID never gets to Parquet files upon writes (The Parquet schema won't contain any VOID columns)

* VOID will be present in Delta metadata schema

* Reads will generate null values on the fly, as the VOID columns will be missing from Parquet files, but still defined in Delta schema

Because we are not writing VOID into Parquet, this results in the following limitations that we should catch and throw a proper error:

* We can't write data into a table with VOID in an array/map.

* We can't write data into a table with only VOID columns, nothing else.

* We can't write data into a table that has a struct with only VOID fields, nothing else.

Note that for these cases, the write operation should fail with a proper error, but we can still perform metadata-only operations or read from such tables. For example, it's possible to create such an empty table, and then evolve the type with schema evolution or add other columns to fix the schema.

How does this behavior sound to you? Do you think you can change your PR to behave this way for alignment with how Delta Spark connector will behave? From what I understand, it seems originally your PR had a similar behavior, so hopefully it is not too much work.

Thank you @ZiyaZa for the detailed context on the upcoming Spark connector changes, this is extremely helpful for getting the semantics right.

The approach you describe makes a lot of sense:

  • Keep void columns visible in reads, generating null values on the fly (since void is absent from Parquet but defined in the Delta schema)
  • Only validate void-in-array/map, all-void structs, and all-void tables at write time
  • Ensure reads and metadata-only operations always succeed

My PR currently mirrors the existing dropNullTypeColumns behavior for consistency, but aligning with where Spark is heading is clearly the better long-term direction, especially since dropping columns on reads is technically incorrect (the column does exist in the table schema).

I'm happy to rework the PR accordingly. Concretely, this would mean:

  1. Remove the recursive void field drop during scan
  2. Keep void fields in the logical schema: in my understanding, the existing scan transform mechanism (the same one used for partition columns) should naturally generate null values for void columns missing from Parquet files
  3. Move the array/map/all-void validation from schema load time to the write path
  4. Update tests to verify that void columns appear in read results as null columns

Before proceeding with the rework, I'll wait for @emkornfield to confirm the direction so we're all aligned.

@emkornfield
Copy link
Copy Markdown
Collaborator

@cdelmonte-zg yes, this approach also seems like the right thing to do. Appreciate your flexibility here.

cdelmonte-zg added a commit to cdelmonte-zg/delta-kernel-rs that referenced this pull request Mar 1, 2026
Mirror Spark connector behavior for void type handling:
  - Drop void columns at scan time so consumers never see them
  - Reject void nested inside Array/Map with schema validation error
  - Add void_utils module with early-exit visitor for validation
  - Wire validation into TableConfiguration on schema load
  - Add integration tests with a Delta table containing a void column

Addresses review feedback in delta-io#1858.

Closes delta-io#739
@github-actions github-actions Bot removed the breaking-change Public API change that could cause downstream compilation failures. Requires a major version bump. label Mar 1, 2026
@cdelmonte-zg cdelmonte-zg reopened this Mar 1, 2026
@github-actions github-actions Bot added the breaking-change Public API change that could cause downstream compilation failures. Requires a major version bump. label Mar 1, 2026
@cdelmonte-zg
Copy link
Copy Markdown
Author

Updated based on all review feedback. Key changes since last review:

  • Void columns stay visible on reads (null-on-read, per @ZiyaZa's recommendation)
  • Write-time validation only (void-in-array/map, all-void struct/table)
  • Recursive void stripping in both physical schema and logical-to-physical Transform
  • Added tests for container types, NullArray edge cases, nested void in structs

Ready for re-review.

Comment thread kernel/src/engine/arrow_conversion.rs Outdated
Ok(())
}

// Fokko review: void is inherently always-null, so nullable=false is semantically
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets not leave peoples names in the comments.

Copy link
Copy Markdown
Author

@cdelmonte-zg cdelmonte-zg Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed: removed all person names from code comments.

// Arrow conversion still works — produces Null type
let arrow_field = ArrowField::try_from_kernel(&field)?;
assert_eq!(arrow_field.data_type(), &ArrowDataType::Null);
assert!(!arrow_field.is_nullable());
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i assume we have a test at a higher level in delta that will error on this?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is tied to the nullable=false question below (line 2211). For now we tolerate it on reads. If we decide to reject it, we'll add validation and update this test accordingly.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a comment there then?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just watch out -- arrow's NullArray::is_null method always returns false. Have to request Array::logical_nulls instead if you need accurate nulls. Several other data types fall in the same category. I believe this is the first time we would introduce such a type into kernel, so it could break all kinds of null checks.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done: added comment explaining why we tolerate nullable=false for void on reads.

Good to know, thanks. In this PR we don't rely on is_null() for NullArray, we check the Arrow type directly (ArrowDataType::Null). But worth keeping in mind if future code needs null checks on this type.

fn test_void_field_in_struct() -> DeltaResult<()> {
// A struct schema with a void column should convert to Arrow with a Null field
let schema = StructType::try_new([
StructField::nullable("id", DataType::INTEGER),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason integer column is included here?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The integer column is intentional. It verifies that void works correctly alongside normal types in a struct, which is the realistic use case. The isolated void roundtrip is covered by test_void_type_roundtrip.

@ZiyaZa
Copy link
Copy Markdown

ZiyaZa commented Mar 23, 2026

Is there any harm to writing out the Delta VOID column as parquet UNKNOWN tho? Delta's read path would be specified to not even try to read it from parquet, but downstream readers who rely on parquet schema would at least have something to work with?

I believe it can cause crashes for readers that can't deal with UNKNOWN correctly. Parquet spec has UNKNOWN since 2021 (and NULL even before that), but it's not widely supported as far as I know. Parquet-java only added it in 2025. Spark only started supporting UNKNOWN with my PR there recently.

If we write it out as UNKNOWN, then for reads:

  • We can probably make Kernel reads work with some improvements to this PR if needed
    • Do we need to worry about how old Kernel versions would behave if there is UNKNOWN in the file?
  • The Spark connector will attempt to read NullType from the data files if it is there (because it will rely on the Spark Parquet reader automatically generating null values for columns that are missing from the file). I believe this will still work if we write it out as UNKNOWN.
    • I'm not sure how the old versions of the Spark connector would behave. The reads that involve NullType are already super broken, so perhaps it won't break any more than that. The error message might change.
  • Other external reader implementations: Here I've no idea what's going to happen. Depending on the implementation specifics, it can maybe crash or maybe continue working.

So the only harm that I can think of is if there is some other reader that doesn't understand UNKNOWN and will start crashing with it being present in the file.

@scovich
Copy link
Copy Markdown
Collaborator

scovich commented Mar 23, 2026

Parquet spec has UNKNOWN since 2021 (and NULL even before that)

I couldn't find anything about a NULL type in the parquet spec?
i.e. https://github.com/apache/parquet-format/blob/master/LogicalTypes.md

And Google AI overview says parquet doesn't have such a type?

@scovich
Copy link
Copy Markdown
Collaborator

scovich commented Mar 23, 2026

the only harm that I can think of is if there is some other reader that doesn't understand UNKNOWN and will start crashing with it being present in the file.

Wouldn't such a reader anyway not understand the Delta VOID type in the first place, and fail during analysis? Long before execution where parquet becomes relevant?

@ZiyaZa
Copy link
Copy Markdown

ZiyaZa commented Mar 23, 2026

I couldn't find anything about a NULL type in the parquet spec? i.e. https://github.com/apache/parquet-format/blob/master/LogicalTypes.md

And Google AI overview says parquet doesn't have such a type?

I think UNKNOWN replaced NULL, looking at the git history, see apache/parquet-format@4bddbad

@ZiyaZa
Copy link
Copy Markdown

ZiyaZa commented Mar 23, 2026

Wouldn't such a reader anyway not understand the Delta VOID type in the first place, and fail during analysis? Long before execution where parquet becomes relevant?

I don't think so, because VOID and UNKNOWN are kind of different. In the Delta Protocol, it doesn't say anything regarding UNKNOWN type, and it is very vague when mentioning void type:

Note: Existing tables may have void data type columns. Behavior is undefined for void data type columns but it is recommended to drop any void data type columns on reads (as is implemented by the Spark connector).

So, from this, readers must understand the VOID type, because it is in the protocol. But they may or may not understand UNKNOWN. If we write files with UNKNOWN, we require readers to know how to read it from Parquet files.

You could also argue that "Behavior is undefined" means queries involving VOID can also crash, so breaking readers is not a problem.

When I make the changes in Spark connector, I'll also make some changes to Protocol to specify how VOID needs to be handled. But in the current state, it's all very vague and can be understood both ways.

@cdelmonte-zg
Copy link
Copy Markdown
Author

@emkornfield Hi, a gentle ping on this PR. I've addressed the requested changes and simplified the tests as suggested. Would appreciate a re-review when you have time. Thanks!

@scovich
Copy link
Copy Markdown
Collaborator

scovich commented Apr 22, 2026

Reviews aside, it looks like there are significant merge conflicts that could be resolved?

cdelmonte-zg added a commit to cdelmonte-zg/delta-kernel-rs that referenced this pull request Apr 23, 2026
Mirror Spark connector behavior for void type handling:
  - Drop void columns at scan time so consumers never see them
  - Reject void nested inside Array/Map with schema validation error
  - Add void_utils module with early-exit visitor for validation
  - Wire validation into TableConfiguration on schema load
  - Add integration tests with a Delta table containing a void column

Addresses review feedback in delta-io#1858.

Closes delta-io#739
@cdelmonte-zg
Copy link
Copy Markdown
Author

Done, rebased onto main, ready for review.

cdelmonte-zg added a commit to cdelmonte-zg/delta-kernel-rs that referenced this pull request Apr 23, 2026
Mirror Spark connector behavior for void type handling:
  - Drop void columns at scan time so consumers never see them
  - Reject void nested inside Array/Map with schema validation error
  - Add void_utils module with early-exit visitor for validation
  - Wire validation into TableConfiguration on schema load
  - Add integration tests with a Delta table containing a void column

Addresses review feedback in delta-io#1858.

Closes delta-io#739
@github-actions
Copy link
Copy Markdown

PR title does not match the required pattern. Please ensure you follow the conventional commits spec.

Your title should start with feat:, fix:, chore:, docs:, perf:, refactor:, test:, or ci:, and if it's a breaking change that should be suffixed with a ! (like feat!:), and then a 1-72 character brief description of your change.

Title: feat: support void primitive type in schema

@cdelmonte-zg cdelmonte-zg changed the title feat: support void primitive type in schema feat!: support void primitive type in schema Apr 23, 2026
Add support for the `void` data type as defined in the Delta protocol. Existing tables may have void columns; the kernel now parses them and maps them to Arrow's Null type.

Closes delta-io#739
Mirror Spark connector behavior for void type handling:
  - Drop void columns at scan time so consumers never see them
  - Reject void nested inside Array/Map with schema validation error
  - Add void_utils module with early-exit visitor for validation
  - Wire validation into TableConfiguration on schema load
  - Add integration tests with a Delta table containing a void column

Addresses review feedback in delta-io#1858.

Closes delta-io#739
- Replace top-level-only void filter with recursive SchemaTransform
    that drops void fields at any struct nesting level (Spark parity)
  - Add drop_void_fields() using SchemaTransform in void_utils
  - Add unit tests for nested and deeply nested void drop
  - Add integration test for void inside nested struct
  - Add test for explicit projection with void column
  - Cover parquet row group skipping Void branches (min/max stats)
  - Cover arrow expression Void => NullBuilder branch
…time

Rework void type handling per ZiyaZa/emkornfield review:
  - Reads return void columns as null (no longer dropped from schema)
  - Write-time validation rejects void in Array/Map, all-void
structs/tables
  - Metadata-only operations always succeed regardless of void placement
  - Physical schema recursively strips void fields (including nested
structs)
  - Fix VOID↔Null mapping in ensure_data_types for expression evaluation

  Addresses Fokko's inline reviews:
  - Add test for nullable=false + void (tolerated on reads)
  - Document unreachable void stats branches, remove artificial tests
…time

Rework void type handling per ZiyaZa/emkornfield review:
  - Reads return void columns as null (no longer dropped from schema)
  - Write-time validation rejects void in Array/Map, all-void
structs/tables
  - Metadata-only operations always succeed regardless of void placement
  - Physical schema recursively strips void fields (including nested
structs)
  - Fix VOID↔Null mapping in ensure_data_types for expression evaluation
  - Fix NullArray crash in compute_nested_null_masks (null buffer
propagation)

  Addresses Fokko's inline reviews:
  - Add test for nullable=false + void (tolerated on reads)
  - Document unreachable void stats branches, remove artificial tests
…n void assertions

Add a targeted unit test for compute_nested_null_masks with a NullArray
  child and non-empty parent nulls, verifying the early-return fix
doesn't
  regress. Strengthen void-in-array and void-in-map read tests to assert
  null_count() == len() (entire column is NULL, not just nullable).
Refine
  comments to clarify semantics: columns absent from Parquet schema are
  synthesized as null columns at read time.
…unt quirk

In the compute_nested_null_masks unit test, assert that NullArray
reports
  null_count()==0 (Arrow has no null bitmap for NullArray) and document
the
  quirk. In void-in-array and void-in-map read tests, drop incorrect
  DataType::Null assertions — synthesized columns preserve the container
  type (List<Null>, Map<String,Null>) matching the Delta schema, not a
bare
  NullArray — and clarify this in comments.
…n-map columns

Verify that void-in-array materializes as List<Null> and void-in-map
as
  Map<Utf8,Null>, not a bare NullArray or a different container. Guards
  against regressions where the synthesized column type silently
changes.
generate_logical_to_physical() only dropped top-level void fields,
while
  the physical schema stripped them recursively via
strip_void_from_field.
  This mismatch would cause the expression evaluator to pass void
sub-fields
  that the physical schema did not expect. Add
build_void_stripping_transform()
  to construct nested Transform expressions that drop void at all struct
  nesting levels.

  Also add: unit tests for strip_void_from_field (passthrough, mixed
void,
  depth 3, metadata preservation), IS NOT NULL predicate on void column,
  write_rejects_void_in_map_key integration test, and E2E test verifying
  the transform drops nested void fields.
Separate the void-detection pre-scan from the Transform construction
  in build_void_stripping_transform(). The new contains_void() does a
  pure boolean traversal without allocating intermediate Transforms,
  reducing cost on deeply nested struct schemas.
- Remove person names from code comments
- Enforce strict empty-string parsing for void scalars
- Use top-level import for validate_schema_for_write
- Fix Parquet comment (void not written by Delta, not a Parquet
limitation)
- Name the error_message field in CheckVoidInComplexTypes
- Refactor StripVoidFields and CheckAllVoidStructs to use
SchemaTransform (local structs, following MakePhysical pattern)
- Unify "all fields are void" error message for top-level and nested
- Fix read test comment (void columns are read as null, not dropped)
- Parametrize void-in-complex-type validation tests with rstest
Address review feedback by parametrizing void_utils tests with rstest
(validation, write rejection, and strip_void cases) and extracting a
scan_with_void_schema helper to reduce duplicated setup in read tests.
With void type now supported in schema deserialization, the _snapshot
workloads in void_001..void_007 pass. The _read_all workloads still fail
due to a separate Arrow/parquet reader limitation (cannot annotate
Unknown logical type from BOOLEAN physical type), so narrow the expected
failure patterns accordingly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking-change Public API change that could cause downstream compilation failures. Requires a major version bump.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support reading schema string with void type

5 participants