feat: Add geo schema types and table feature#2464
feat: Add geo schema types and table feature#2464lorenarosati wants to merge 14 commits intodelta-io:mainfrom
Conversation
|
PR title does not match the required pattern. Please ensure you follow the conventional commits spec. Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: |
|
PR title does not match the required pattern. Please ensure you follow the conventional commits spec. Your title should start with Title: |
Range-diff: main (b369f06 -> dc5d29e)
... (truncated, output exceeded 60000 bytes) Reproduce locally: |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2464 +/- ##
==========================================
+ Coverage 88.40% 88.50% +0.09%
==========================================
Files 178 179 +1
Lines 58235 59313 +1078
Branches 58235 59313 +1078
==========================================
+ Hits 51485 52495 +1010
- Misses 4783 4800 +17
- Partials 1967 2018 +51 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Range-diff: main (dc5d29e -> b8a88e5)
... (truncated, output exceeded 60000 bytes) Reproduce locally: |
b8a88e5 to
dc5d29e
Compare
| _ => unreachable!(), | ||
| } | ||
| } | ||
| // Geometry/Geography are not valid partition column types, so there is no |
There was a problem hiding this comment.
Geo columns should not be partition columns
There was a problem hiding this comment.
Wonder if we want to detect this when we
- Create a transaction
- Create a table
?
There was a problem hiding this comment.
Write path considerations like this should be out of scope for now, and I checked and seems like geo follows the same level of checks that other types that can't be partition values do - they all error at parse_partition_value_raw (geo is a primitive so it'll go to parse_Scalar and error there)
There was a problem hiding this comment.
geo follows the same level of checks that other types that can't be partition values do -
Seems not totally same level of checks? Map, Array, Variants as partition columns are rejected in create table(validate_partition_columns) but geo types are not
| /// the `geospatial` feature in both reader and writer features. | ||
| pub(crate) fn validate_geospatial_feature_support(tc: &TableConfiguration) -> DeltaResult<()> { | ||
| let protocol = tc.protocol(); | ||
| if !protocol.has_table_feature(&TableFeature::GeospatialType) { |
There was a problem hiding this comment.
Mirrors checks and testing for variant and timestamp feature support: https://github.com/delta-io/delta-kernel-rs/blob/main/kernel/src/schema/variant_utils.rs, https://github.com/delta-io/delta-kernel-rs/blob/main/kernel/src/table_features/timestamp_ntz.rs
Range-diff: main (dc5d29e -> ea12da4)
... (truncated, output exceeded 60000 bytes) Reproduce locally: |
| Ok(Self { srid, algorithm }) | ||
| } | ||
|
|
||
| /// Creates a new GeographyType with the given SRID and the default edge interpolation |
There was a problem hiding this comment.
Mirrors constructors available in java kernel
| .map(PrimitiveType::Decimal) | ||
| .map_err(serde::de::Error::custom) | ||
| } | ||
| "geometry" => Ok(PrimitiveType::Geometry(GeometryType::default())), |
Range-diff: main (ea12da4 -> 14a713c)
... (truncated, output exceeded 60000 bytes) Reproduce locally: |
Range-diff: main (14a713c -> 68b8cd8)
... (truncated, output exceeded 60000 bytes) Reproduce locally: |
Range-diff: main (68b8cd8 -> 848b098)
... (truncated, output exceeded 60000 bytes) Reproduce locally: |
848b098 to
06499d0
Compare
Range-diff: main (06499d0 -> 3d37746)
... (truncated, output exceeded 60000 bytes) Reproduce locally: |
Range-diff: main (3d37746 -> aac9b0a)
... (truncated, output exceeded 60000 bytes) Reproduce locally: |
| #[serde(serialize_with = "serialize_decimal", untagged)] | ||
| Decimal(DecimalType), | ||
| #[serde(serialize_with = "serialize_geometry", untagged)] | ||
| Geometry(Box<GeometryType>), |
There was a problem hiding this comment.
Box b/c adding geo types made the Error enum exceed 128 bytes and failed clippy check (this solution matches existing DataType convention of boxing types)
Range-diff: main (aac9b0a -> 97281f0)
... (truncated, output exceeded 60000 bytes) Reproduce locally: |
| "geography" => Ok(PrimitiveType::Geography(Box::default())), | ||
| geo_str if geo_str.starts_with("geography(") && geo_str.ends_with(')') => { | ||
| let inner = &geo_str[10..geo_str.len() - 1]; | ||
| // Three accepted shapes: |
There was a problem hiding this comment.
Java kernel deserializes the same cases:
- geometry
- geography
- geometry with an SRID in brackets
- geography with SRID and alg in brackets
- geography with SRID in brackets
- geography with alg in brackets (differentiated from just SRID case by a regex pattern that checks for a colon)
There was a problem hiding this comment.
I wonder what's the case for
geography with SRID in brackets
geography with alg in brackets (differentiated from just SRID case by a regex pattern that checks for a colon)
As we only serialize to geography(srid, alg). Is java-kernel trying to be compatible with some other impl? If so, a comment would be very helpful!
There was a problem hiding this comment.
Good callout! I can leave this as a TODO to look into but this should be non-blocking for this PR
There was a problem hiding this comment.
Let's add a short comment just stating this is following the convention from kernel-java. So that if we re-visit it in the future we can know.
dengsh12
left a comment
There was a problem hiding this comment.
Looks good! Several comments/questions. Also, could we add integration tests to validate the write-and-read round path?
| _ => unreachable!(), | ||
| } | ||
| } | ||
| // Geometry/Geography are not valid partition column types, so there is no |
There was a problem hiding this comment.
Wonder if we want to detect this when we
- Create a transaction
- Create a table
?
| "geography" => Ok(PrimitiveType::Geography(Box::default())), | ||
| geo_str if geo_str.starts_with("geography(") && geo_str.ends_with(')') => { | ||
| let inner = &geo_str[10..geo_str.len() - 1]; | ||
| // Three accepted shapes: |
There was a problem hiding this comment.
I wonder what's the case for
geography with SRID in brackets
geography with alg in brackets (differentiated from just SRID case by a regex pattern that checks for a colon)
As we only serialize to geography(srid, alg). Is java-kernel trying to be compatible with some other impl? If so, a comment would be very helpful!
| #[test] | ||
| fn test_roundtrip_geometry() { | ||
| let data = r#" | ||
| { | ||
| "name": "g", | ||
| "type": "geometry(EPSG:4326)", | ||
| "nullable": true, | ||
| "metadata": {} | ||
| } | ||
| "#; | ||
| let field: StructField = serde_json::from_str(data).unwrap(); | ||
| assert_eq!( | ||
| field.data_type, | ||
| DataType::Primitive(PrimitiveType::Geometry(Box::new( | ||
| GeometryType::try_new("EPSG:4326").unwrap() | ||
| ))) | ||
| ); | ||
|
|
||
| let json_str = serde_json::to_string(&field).unwrap(); | ||
| assert_eq!( | ||
| json_str, | ||
| r#"{"name":"g","type":"geometry(EPSG:4326)","nullable":true,"metadata":{}}"# | ||
| ); | ||
| } | ||
|
|
||
| #[test] | ||
| fn test_roundtrip_geography() { | ||
| let data = r#" | ||
| { | ||
| "name": "g", | ||
| "type": "geography(EPSG:4326, vincenty)", | ||
| "nullable": true, | ||
| "metadata": {} | ||
| } | ||
| "#; | ||
| let field: StructField = serde_json::from_str(data).unwrap(); | ||
| assert_eq!( | ||
| field.data_type, | ||
| DataType::Primitive(PrimitiveType::Geography(Box::new( | ||
| GeographyType::try_new("EPSG:4326", EdgeInterpolationAlgorithm::Vincenty).unwrap() | ||
| ))) | ||
| ); | ||
|
|
||
| let json_str = serde_json::to_string(&field).unwrap(); | ||
| assert_eq!( | ||
| json_str, | ||
| r#"{"name":"g","type":"geography(EPSG:4326, vincenty)","nullable":true,"metadata":{}}"# | ||
| ); | ||
| } |
There was a problem hiding this comment.
NIT: Seems these two can be merged into test_geo_deserialize_defaults? We can rename test_geo_deserialize_defaults to test_geo_deserialize_succeed and assert the exact deserialized value for all cases
There was a problem hiding this comment.
Agreed. We can check round trip there
Range-diff: main (97281f0 -> 7546790)
... (truncated, output exceeded 60000 bytes) Reproduce locally: |
Range-diff: main (7546790 -> 6245673)
... (truncated, output exceeded 60000 bytes) Reproduce locally: |
Range-diff: main (6245673 -> 12c1bf6)
... (truncated, output exceeded 60000 bytes) Reproduce locally: |
| pub fn try_new(srid: &str) -> DeltaResult<Self> { | ||
| if srid.is_empty() { | ||
| return Err(Error::invalid_geometry("SRID cannot be empty")); | ||
| } | ||
| Ok(Self { | ||
| srid: srid.to_string(), | ||
| }) | ||
| } |
There was a problem hiding this comment.
This likely needs validation so that we're producing valid Geotypes.
There was a problem hiding this comment.
ig this needs validation?
There was a problem hiding this comment.
Two options for validation:
- Fixed set of valid SRID strings that we check against.
- use parse/understand geotypes. Possibly using geo crate?
If set is small, I'd recommend 1. else, we probably want 2.
There was a problem hiding this comment.
Discussed offline - documented future work in a comment!
There was a problem hiding this comment.
Chatted offline. This can be 100s of SRIDs. Let's make this a followup
There was a problem hiding this comment.
Keeping the comment as unresolved for future readers :)
Range-diff: main (12c1bf6 -> 10b79aa)
... (truncated, output exceeded 60000 bytes) Reproduce locally: |
| )) | ||
| )] | ||
| #[case( | ||
| "geography(EPSG:4326, vincenty)", |
There was a problem hiding this comment.
Can you add test cases for all the other interpolation algorithms?
| ); | ||
|
|
||
| // Geospatial is not supported for writes | ||
| let config = create_mock_table_config(&[], &[TableFeature::GeospatialType]); |
There was a problem hiding this comment.
There should be an equivalent test for ensure REad and CDF support. Could you add that?
| .map_err(serde::de::Error::custom) | ||
| } | ||
| None => { | ||
| let trimmed = inner.trim(); |
There was a problem hiding this comment.
hmm this would technically allow geography( OGC:CRS84 )
Maybe add a todo comment to reevaluate if that's an issue.
| /// | ||
| /// Returns `Err` if `srid` is empty. | ||
| pub fn try_new(srid: &str) -> DeltaResult<Self> { | ||
| // We only check that the SRID is non-empty; validating the value against the full set |
There was a problem hiding this comment.
CRS value can be specified in one of the following formats:
- A standard authority and identifier (`<authority>:<identifier>`), e.g.:
- `OGC:CRS84`
- `EPSG:3857`
- A custom definition, which can be provided in one of two ways:
- Using a Spatial Reference System Identifier (SRID), e.g. `srid:<number>`.
- Using a projjson reference to a table property where the projjson string is stored, e.g. `projjson:<tableProperty>`.
add validation that it's of form authority:identifier pls.
maybe tests for srid:, foo, :, and empty string, :CRS84
| )) | ||
| )] | ||
| #[case( | ||
| "geography(vincenty)", |
There was a problem hiding this comment.
is this a valid test? Protocol just says:
In the schema the geospatial types are serialized as:
geometry(<crs>)geography(<crs>, <algorithm>)
There was a problem hiding this comment.
I think we should just reject if single-element is not crs ==> geography(<algorithm>) not allowed
There was a problem hiding this comment.
This also removes default CRS above.
There was a problem hiding this comment.
Thanks for iterating! Agree with @OussamaSaoudi's comments, and one more concern on create table side, and as geo is supported by scan, could we include an integration test reading geo table using different combinations of CRS + algorithms?
| } | ||
| // Geometry/Geography are not valid partition column types, so there is no | ||
| // partition-value string format to parse here | ||
| // Kernel does not support parsing text into Geometry/Geography types |
There was a problem hiding this comment.
NIT: // Kernel does not support parsing text into Geometry/Geography types yet
| "geography" => Ok(PrimitiveType::Geography(Box::default())), | ||
| geo_str if geo_str.starts_with("geography(") && geo_str.ends_with(')') => { | ||
| let inner = &geo_str[10..geo_str.len() - 1]; | ||
| // Three accepted shapes: |
There was a problem hiding this comment.
Let's add a short comment just stating this is following the convention from kernel-java. So that if we re-visit it in the future we can know.
| r#"Feature 'typeWidening' is not supported for writes"#, | ||
| ); | ||
|
|
||
| // Geospatial is not supported for writes |
There was a problem hiding this comment.
NIT:
// Geospatial is not supported for writes yet
| _ => unreachable!(), | ||
| } | ||
| } | ||
| // Geometry/Geography are not valid partition column types, so there is no |
There was a problem hiding this comment.
geo follows the same level of checks that other types that can't be partition values do -
Seems not totally same level of checks? Map, Array, Variants as partition columns are rejected in create table(validate_partition_columns) but geo types are not
Range-diff: main (10b79aa -> 52c80b6)
... (truncated, output exceeded 60000 bytes) Reproduce locally: |
| pub(crate) fn is_skipping_eligible_datatype(data_type: &PrimitiveType) -> bool { | ||
| matches!( | ||
| data_type, | ||
| &PrimitiveType::Byte | ||
| | &PrimitiveType::Short | ||
| | &PrimitiveType::Integer | ||
| | &PrimitiveType::Long | ||
| | &PrimitiveType::Float | ||
| | &PrimitiveType::Double | ||
| | &PrimitiveType::Date | ||
| | &PrimitiveType::Timestamp | ||
| | &PrimitiveType::TimestampNtz | ||
| | &PrimitiveType::String | ||
| | PrimitiveType::Decimal(_) | ||
| | PrimitiveType::Geometry(_) | ||
| | PrimitiveType::Geography(_) |
There was a problem hiding this comment.
Should geo be eligible for data skipping?
🥞 Stacked PR
Use this link to review incremental changes.
Note: don't merge until the RFC is merged.
What changes are proposed in this pull request?
This PR implements the following:
How was this change tested?
All tests follow pre-existing test patterns for other types.