Skip to content

feat: Add arrow conversion support in default engine#2496

Open
lorenarosati wants to merge 15 commits intodelta-io:mainfrom
lorenarosati:stack/geo-arrow
Open

feat: Add arrow conversion support in default engine#2496
lorenarosati wants to merge 15 commits intodelta-io:mainfrom
lorenarosati:stack/geo-arrow

Conversation

@lorenarosati
Copy link
Copy Markdown
Collaborator

@lorenarosati lorenarosati commented Apr 29, 2026

🥞 Stacked PR

Use this link to review incremental changes.


What changes are proposed in this pull request?

This PR adds the following:

  • Conversion from kernel to arrow - GeoArrow tagged fields + binary physical Arrow type for geo
  • ParseJson workaround code to parse WKT stats into WKB

How was this change tested?

  • WKT to WKB conversion preserves GeoArrow extension data
  • Non-default CRS (EPSG:4326) survives the WKT→WKB pipeline end-to-end
  • Geography edges payload (e.g. Vincenty) survives the WKT→WKB pipeline end-to-end
  • JSON parsing on schemas with no geo fields short-circuits and returns the input schema unchanged
  • Nullability coercion on a geo-target field attaches the geoarrow extension metadata while preserving the source field's other metadata (e.g. parquet field IDs)
  • The transform path in expression evaluation attaches geoarrow extension metadata + CRS when applying a Geometry target field, instead of producing plain Binary

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 29, 2026

PR title does not match the required pattern. Please ensure you follow the conventional commits spec.

Your title should start with feat:, fix:, chore:, docs:, perf:, refactor:, test:, or ci:, and if it's a breaking change that should be suffixed with a ! (like feat!:), and then a 1-72 character brief description of your change.

Title: temp
PR title does not match the required pattern. Please ensure you follow the conventional commits spec.

Your title should start with feat:, fix:, chore:, docs:, perf:, refactor:, test:, or ci:, and if it's a breaking change that should be suffixed with a ! (like feat!:), and then a 1-72 character brief description of your change.

Title: temp
PR title does not match the required pattern. Please ensure you follow the conventional commits spec.

Your title should start with feat:, fix:, chore:, docs:, perf:, refactor:, test:, or ci:, and if it's a breaking change that should be suffixed with a ! (like feat!:), and then a 1-72 character brief description of your change.

Title: temp
PR title does not match the required pattern. Please ensure you follow the conventional commits spec.

Your title should start with feat:, fix:, chore:, docs:, perf:, refactor:, test:, or ci:, and if it's a breaking change that should be suffixed with a ! (like feat!:), and then a 1-72 character brief description of your change.

Title: temp
PR title does not match the required pattern. Please ensure you follow the conventional commits spec.

Your title should start with feat:, fix:, chore:, docs:, perf:, refactor:, test:, or ci:, and if it's a breaking change that should be suffixed with a ! (like feat!:), and then a 1-72 character brief description of your change.

Title: feat: Add geo scalars and arrow conversion support in default engine
PR title does not match the required pattern. Please ensure you follow the conventional commits spec.

Your title should start with feat:, fix:, chore:, docs:, perf:, refactor:, test:, or ci:, and if it's a breaking change that should be suffixed with a ! (like feat!:), and then a 1-72 character brief description of your change.

Title: feat: Add geo scalars and arrow conversion support in default engine
PR title does not match the required pattern. Please ensure you follow the conventional commits spec.

Your title should start with feat:, fix:, chore:, docs:, perf:, refactor:, test:, or ci:, and if it's a breaking change that should be suffixed with a ! (like feat!:), and then a 1-72 character brief description of your change.

Title: feat: Add geo scalars and arrow conversion support in default engine
PR title does not match the required pattern. Please ensure you follow the conventional commits spec.

Your title should start with feat:, fix:, chore:, docs:, perf:, refactor:, test:, or ci:, and if it's a breaking change that should be suffixed with a ! (like feat!:), and then a 1-72 character brief description of your change.

Title: feat: Add geo scalars and arrow conversion support in default engine
PR title does not match the required pattern. Please ensure you follow the conventional commits spec.

Your title should start with feat:, fix:, chore:, docs:, perf:, refactor:, test:, or ci:, and if it's a breaking change that should be suffixed with a ! (like feat!:), and then a 1-72 character brief description of your change.

Title: feat: Add geo scalars and arrow conversion support in default engine
PR title does not match the required pattern. Please ensure you follow the conventional commits spec.

Your title should start with feat:, fix:, chore:, docs:, perf:, refactor:, test:, or ci:, and if it's a breaking change that should be suffixed with a ! (like feat!:), and then a 1-72 character brief description of your change.

Title: feat: Add geo scalars and arrow conversion support in default engine
PR title does not match the required pattern. Please ensure you follow the conventional commits spec.

Your title should start with feat:, fix:, chore:, docs:, perf:, refactor:, test:, or ci:, and if it's a breaking change that should be suffixed with a ! (like feat!:), and then a 1-72 character brief description of your change.

Title: feat: Add geo scalars and arrow conversion support in default engine
PR title does not match the required pattern. Please ensure you follow the conventional commits spec.

Your title should start with feat:, fix:, chore:, docs:, perf:, refactor:, test:, or ci:, and if it's a breaking change that should be suffixed with a ! (like feat!:), and then a 1-72 character brief description of your change.

Title: feat: Add arrow conversion support in default engine
PR title does not match the required pattern. Please ensure you follow the conventional commits spec.

Your title should start with feat:, fix:, chore:, docs:, perf:, refactor:, test:, or ci:, and if it's a breaking change that should be suffixed with a ! (like feat!:), and then a 1-72 character brief description of your change.

Title: feat: Add arrow conversion support in default engine
PR title does not match the required pattern. Please ensure you follow the conventional commits spec.

Your title should start with feat:, fix:, chore:, docs:, perf:, refactor:, test:, or ci:, and if it's a breaking change that should be suffixed with a ! (like feat!:), and then a 1-72 character brief description of your change.

Title: feat: Add arrow conversion support in default engine
PR title does not match the required pattern. Please ensure you follow the conventional commits spec.

Your title should start with feat:, fix:, chore:, docs:, perf:, refactor:, test:, or ci:, and if it's a breaking change that should be suffixed with a ! (like feat!:), and then a 1-72 character brief description of your change.

Title: feat: Add arrow conversion support in default engine
PR title does not match the required pattern. Please ensure you follow the conventional commits spec.

Your title should start with feat:, fix:, chore:, docs:, perf:, refactor:, test:, or ci:, and if it's a breaking change that should be suffixed with a ! (like feat!:), and then a 1-72 character brief description of your change.

Title: feat: Add arrow conversion support in default engine
PR title does not match the required pattern. Please ensure you follow the conventional commits spec.

Your title should start with feat:, fix:, chore:, docs:, perf:, refactor:, test:, or ci:, and if it's a breaking change that should be suffixed with a ! (like feat!:), and then a 1-72 character brief description of your change.

Title: feat: Add arrow conversion support in default engine

@github-actions github-actions Bot added the breaking-change Public API change that could cause downstream compilation failures. Requires a major version bump. label Apr 29, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 29, 2026

Codecov Report

❌ Patch coverage is 87.50000% with 33 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.50%. Comparing base (6486bd2) to head (10b79aa).
⚠️ Report is 12 commits behind head on main.

Files with missing lines Patch % Lines
kernel/src/schema/mod.rs 89.88% 10 Missing and 7 partials ⚠️
kernel/src/transaction/stats_verifier.rs 0.00% 6 Missing ⚠️
kernel/src/expressions/scalars.rs 0.00% 3 Missing ⚠️
kernel/src/engine/parquet_row_group_skipping.rs 0.00% 2 Missing ⚠️
ffi/src/expressions/kernel_visitor.rs 0.00% 1 Missing ⚠️
ffi/src/schema.rs 0.00% 1 Missing ⚠️
kernel/src/engine/arrow_conversion/mod.rs 0.00% 1 Missing ⚠️
kernel/src/engine/arrow_expression/mod.rs 0.00% 1 Missing ⚠️
test-utils/src/table_builder.rs 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2496      +/-   ##
==========================================
+ Coverage   88.40%   88.50%   +0.09%     
==========================================
  Files         178      179       +1     
  Lines       58235    59313    +1078     
  Branches    58235    59313    +1078     
==========================================
+ Hits        51485    52495    +1010     
- Misses       4783     4800      +17     
- Partials     1967     2018      +51     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@lorenarosati lorenarosati changed the title temp feat: Add geo scalars and arrow conversion support in default engine Apr 30, 2026
@lorenarosati lorenarosati marked this pull request as ready for review May 1, 2026 23:31
@lorenarosati lorenarosati changed the title feat: Add geo scalars and arrow conversion support in default engine feat: Add arrow conversion support in default engine May 1, 2026
@lorenarosati lorenarosati requested a review from OussamaSaoudi May 1, 2026 23:45
@lorenarosati
Copy link
Copy Markdown
Collaborator Author

Range-diff: stack/schema-table-feat-geo (b57aaab -> e31c15f)
kernel/src/engine/arrow_expression/apply_schema.rs
@@ -43,4 +43,50 @@
 +            };
              Ok((transformed_field, transformed_col))
          });
-     let (transformed_fields, transformed_cols): (Vec<ArrowField>, Vec<ArrayRef>) =
\ No newline at end of file
+     let (transformed_fields, transformed_cols): (Vec<ArrowField>, Vec<ArrayRef>) =
+         );
+     }
+ 
++    #[cfg(feature = "arrow-conversion")]
++    #[test]
++    fn test_apply_schema_attaches_geoarrow_metadata_for_geo_field() {
++        use crate::arrow::array::BinaryArray;
++        use crate::schema::{GeometryType, PrimitiveType};
++
++        let target_schema = StructType::new_unchecked([StructField::nullable(
++            "geom",
++            DataType::Primitive(PrimitiveType::Geometry(Box::new(
++                GeometryType::try_new("EPSG:4326").unwrap(),
++            ))),
++        )]);
++
++        let arrow_field = ArrowField::new("geom", ArrowDataType::Binary, true);
++        let input_array = StructArray::try_new(
++            vec![arrow_field].into(),
++            vec![Arc::new(BinaryArray::from(vec![Some(
++                b"\x01\x01\x00\x00\x00".as_ref(),
++            )]))],
++            None,
++        )
++        .unwrap();
++
++        let result = apply_schema_to_struct(&input_array, &target_schema).unwrap();
++
++        let (_, output_field) = result.fields().find("geom").unwrap();
++        assert_eq!(
++            output_field
++                .metadata()
++                .get("ARROW:extension:name")
++                .map(String::as_str),
++            Some("geoarrow.wkb"),
++        );
++        let ext_meta = output_field
++            .metadata()
++            .get("ARROW:extension:metadata")
++            .expect("geo target should produce CRS extension metadata");
++        assert!(ext_meta.contains("EPSG:4326"));
++    }
++
+     /// Test that apply_schema succeeds when the input Arrow field already carries the same field
+     /// ID as the target kernel schema field (no conflict).
+     #[test]
\ No newline at end of file
kernel/src/engine/arrow_utils.rs
@@ -69,6 +69,62 @@
          return Ok(batch);
      }
      Err(Error::generic(
+         assert!(map_val.is_nullable());
+     }
+ 
++    #[cfg(feature = "arrow-conversion")]
++    #[test]
++    fn test_coerce_batch_nullability_geo_target_attaches_extension() {
++        use std::collections::HashMap;
++
++        use crate::geoarrow_schema::{GeoArrowType, Metadata, WkbType};
++
++        let src_field = ArrowField::new("geom", ArrowDataType::Binary, true)
++            .with_metadata(HashMap::from([(
++                "parquet.field.id".to_string(),
++                "1".to_string(),
++            )]));
++        let src_array: Arc<dyn ArrowArray> = Arc::new(crate::arrow::array::BinaryArray::from(
++            vec![Some(b"\x01\x01\x00\x00\x00".as_ref())],
++        ));
++        let src_schema = Arc::new(ArrowSchema::new(vec![src_field]));
++        let batch = RecordBatch::try_new(src_schema, vec![src_array]).unwrap();
++
++        let crs = crate::geoarrow_schema::crs::Crs::from_srid("EPSG:4326".to_string());
++        let target_field =
++            GeoArrowType::Wkb(WkbType::new(Arc::new(Metadata::new(crs, None)))).to_field("geom", true);
++        let target_schema = Arc::new(ArrowSchema::new(vec![target_field]));
++
++        let result = coerce_batch_nullability(batch, &target_schema, None).unwrap();
++        let out_field = result.schema().field(0).clone();
++
++        assert_eq!(
++            out_field
++                .metadata()
++                .get("ARROW:extension:name")
++                .map(String::as_str),
++            Some("geoarrow.wkb"),
++        );
++        let ext_meta = out_field
++            .metadata()
++            .get("ARROW:extension:metadata")
++            .expect("CRS metadata should be present");
++        assert!(ext_meta.contains("EPSG:4326"));
++
++        // `with_extension_type` preserves source metadata; the prior `target_field.clone()`
++        // path would have dropped this.
++        assert_eq!(
++            out_field
++                .metadata()
++                .get("parquet.field.id")
++                .map(String::as_str),
++            Some("1"),
++        );
++    }
++
+     // --- Tests for build_json_reorder_indices and json_arrow_schema ---
+ 
+     const FILE_PATH: &str = "s3://bucket/test.json";
              ]
          );
      }
@@ -113,6 +169,26 @@
 +            "WKB bytes should be non-empty"
 +        );
 +
++        let geom_field = match batch.schema().field(0).data_type() {
++            ArrowDataType::Struct(f) => f[0].clone(),
++            other => panic!("expected Struct minValues, got {other:?}"),
++        };
++        assert_eq!(
++            geom_field
++                .metadata()
++                .get("ARROW:extension:name")
++                .map(String::as_str),
++            Some("geoarrow.wkb"),
++        );
++        let ext_meta = geom_field
++            .metadata()
++            .get("ARROW:extension:metadata")
++            .expect("CRS metadata should be present on output");
++        assert!(
++            ext_meta.contains("OGC:CRS84"),
++            "output CRS should be OGC:CRS84, got {ext_meta}"
++        );
++
 +        // Missing geom field produces a null entry
 +        let input: Vec<Option<&str>> = vec![Some(r#"{"minValues":{}}"#)];
 +        let batch = parse_json_impl(&input.into(), schema.clone()).unwrap();
@@ -248,4 +324,72 @@
 +        assert_eq!(binary.len(), 1);
 +        assert!(!binary.value(0).is_empty(), "WKB bytes should be non-empty");
 +    }
++
++    #[cfg(feature = "arrow-conversion")]
++    #[test]
++    fn test_parse_json_impl_preserves_crs_and_edges_metadata() {
++        use crate::geoarrow_schema::{Edges, GeoArrowType, Metadata, WkbType};
++
++        fn parse_one_geo_row(geo_type: GeoArrowType, field_name: &str) -> ArrowField {
++            let geo_field = geo_type.to_field(field_name, true);
++            let schema = Arc::new(ArrowSchema::new(vec![ArrowField::new(
++                "minValues",
++                ArrowDataType::Struct(ArrowFields::from(vec![geo_field])),
++                true,
++            )]));
++            let json = format!(r#"{{"minValues":{{"{field_name}":"POINT(1.0 2.0)"}}}}"#);
++            let input: Vec<Option<&str>> = vec![Some(&json)];
++            let batch = parse_json_impl(&input.into(), schema).unwrap();
++            match batch.schema().field(0).data_type() {
++                ArrowDataType::Struct(f) => f[0].as_ref().clone(),
++                other => panic!("expected Struct, got {other:?}"),
++            }
++        }
++
++        let crs = crate::geoarrow_schema::crs::Crs::from_srid("EPSG:4326".to_string());
++        let geometry_field = parse_one_geo_row(
++            GeoArrowType::Wkb(WkbType::new(Arc::new(Metadata::new(crs, None)))),
++            "geom",
++        );
++        let geometry_meta = geometry_field
++            .metadata()
++            .get("ARROW:extension:metadata")
++            .expect("Geometry output should have CRS metadata");
++        assert!(
++            geometry_meta.contains("EPSG:4326"),
++            "Geometry CRS should be EPSG:4326, got {geometry_meta}"
++        );
++
++        let crs = crate::geoarrow_schema::crs::Crs::from_srid("OGC:CRS84".to_string());
++        let geography_field = parse_one_geo_row(
++            GeoArrowType::Wkb(WkbType::new(Arc::new(Metadata::new(
++                crs,
++                Some(Edges::Vincenty),
++            )))),
++            "geog",
++        );
++        let geography_meta = geography_field
++            .metadata()
++            .get("ARROW:extension:metadata")
++            .expect("Geography output should have CRS metadata");
++        assert!(
++            geography_meta.contains("\"edges\":\"vincenty\""),
++            "Geography output should carry edges=vincenty, got {geography_meta}"
++        );
++    }
++
++    #[test]
++    fn test_parse_json_impl_no_op_when_schema_has_no_geo() {
++        let schema = Arc::new(ArrowSchema::new(vec![
++            ArrowField::new("id", ArrowDataType::Int64, true),
++            ArrowField::new("name", ArrowDataType::Utf8, true),
++        ]));
++
++        let input: Vec<Option<&str>> = vec![Some(r#"{"id":42,"name":"hello"}"#)];
++        let batch = parse_json_impl(&input.into(), schema.clone()).unwrap();
++
++        assert_eq!(*batch.schema(), *schema);
++        assert_eq!(batch.num_rows(), 1);
++        assert_eq!(batch.num_columns(), 2);
++    }
  }
\ No newline at end of file

Reproduce locally: git range-diff 10b79aa..b57aaab 10b79aa..e31c15f | Disable: git config gitstack.push-range-diff false

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking-change Public API change that could cause downstream compilation failures. Requires a major version bump.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant