fix!: add container prefix support for default engine#2394
fix!: add container prefix support for default engine#2394Jameson-Crate wants to merge 1 commit intodelta-io:mainfrom
Conversation
b655e68 to
70f7595
Compare
Range-diff: main (b655e68 -> 70f7595)
Reproduce locally: |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2394 +/- ##
==========================================
+ Coverage 88.50% 88.54% +0.04%
==========================================
Files 179 179
Lines 59031 59383 +352
Branches 59031 59383 +352
==========================================
+ Hits 52245 52582 +337
- Misses 4785 4795 +10
- Partials 2001 2006 +5 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
8a585e4 to
ea96e36
Compare
| /// | ||
| /// Prefer [`DefaultEngineBuilder::from_url`] when you don't already have a pre-built | ||
| /// object store -- it computes the store and prefix in one step. | ||
| pub fn new(object_store: Arc<DynObjectStore>, url_path_prefix: Path) -> Self { |
There was a problem hiding this comment.
This doesn't fit the builder pattern? The whole point of the pattern is to avoid adding a bunch of params to a constructor (especially those that are effectively optional). Instead, I think you should:
- Leave the existing
newconstructor as is (then you don't need to update every single caller to pass in empty path) - Replace
from_urlwithwith_url_prefixthat allows setting a prefix for the engine being built - When
build()is called, if no Path was set for url_prefix, construct the handlers with empty path
There was a problem hiding this comment.
What you are describing was the change I had initially, but I opted for this change instead because there is essentially no time that you should be passing in the object store you construct without passing in the path prefix, since these are both always a requirement it makes sense to have them both in the constructor of the builder rather than to have path prefix passed through a builder method.
With this change we would instead prefer that engines are constructed using from_url primarily and only use new when they want more control over anything relating to object store creation. The majority of connectors should fall into the first bucket and won't need to worry about path prefixes or object stores at all and the connectors that do want more control over the object store and what happens before it is passed to engine have the option but are now forced to consider path prefixes.
There was a problem hiding this comment.
because there is essentially no time that you should be passing in the object store you construct without passing in the path prefix
The majority of connectors should fall into the first bucket and won't need to worry about path prefixes
I think these two statements are contradictory? Based on the current usage pattern, most of the time you are passing in empty path prefix? So to me that indicates we should just have a default of path_prefix = "" - which is what the builder pattern is for?
There was a problem hiding this comment.
Handled Offline -- decided to move forward by keeping path_prefix as a required argument for the DefaultEngineBuilder constructor while coupling the store and path prefix into a single type which shows they are related
| let prefix_parts: Vec<_> = url_path_prefix.parts().collect(); | ||
| let full_parts: Vec<_> = full.parts().collect(); | ||
| if full_parts.len() < prefix_parts.len() || full_parts[..prefix_parts.len()] != prefix_parts[..] | ||
| { | ||
| return Err(DeltaError::generic(format!( | ||
| "URL path '{}' does not start with expected prefix '{}'", | ||
| full.as_ref(), | ||
| url_path_prefix.as_ref(), | ||
| ))); | ||
| } | ||
| Ok(Path::from_iter( | ||
| full_parts[prefix_parts.len()..].iter().cloned(), | ||
| )) |
There was a problem hiding this comment.
Is there a reason you can't use Path::prefix_match for this? https://docs.rs/object_store/latest/object_store/path/struct.Path.html#method.prefix_match
There was a problem hiding this comment.
great callout
| let path = Path::parse(url.path())?; | ||
| Ok((Box::new(store), path)) | ||
| } | ||
| #[rstest] |
There was a problem hiding this comment.
Do we have (or is it possible to add) tests at the store_from_url level? These current tests do not actually validate the scheme-specific behavior, so if our assumptions about how the scheme works is wrong then we would still have bugs. i.e are we sure that object_store::arse_url_opts actually behaves the way we expect for azure vs s3 urls?
There was a problem hiding this comment.
I had set up a small azure container instance with the format that was broken and used delta kernel rs to create a table and list from the table (results of which are in the screenshots in the pr description). It would be difficult to add this test to our unit / integration testing because it relies on a running azure instance, but I could certainly create a mock object store which is able to test store_from_url through more thorough unit testing.
|
Also the issue linked in your PR description is wrong - I assume it should be this one instead? #2209 |
good callout, thanks |
## What changes are proposed in this pull request? This release has all main branch commits up to Wed Apr 29 17:13:32 2026, with SHA= 14e289f.
ea96e36 to
88e33ca
Compare
What changes are proposed in this pull request?
Addresses #2209 by computing and storing the container path prefix during object store creation. The prefix is stored as an object store path which is empty in the majority of cases and a single path segment in other cases (although this implementation supports container prefixes of arbitrary length). This object store prefix is then plumbed through the default engine to each of the respective default handlers.
Going forward connectors will use a similar flow to the one exercised today, but instead of returning a store,
store_from_urlwill return a new structParsedStorewhich contains both of the object store and the path prefix. Default engine will now take aParsedStoreobject.How was this change tested?
Added new unit tests and passing all of the previous tests. Additionally ran integration testing with a
https://*.blob.core.windows.net/URL. For this test I set up an azure instance with the error-prone url type and tested creating a snapshot from it before and after the change. Before the change the listing shows there are no log files because the path is incorrect, after kernel is able to correctly list from the table.Before Fix:

After Fix:

Note to Reviewers
This change looks very large, but many of the updates are purely mechanical. The most critical changes live in three files:
kernel/src/engine/default/storage.rs,kernel/src/engine/default/filesystem.rs, andkernel/src/engine/default/mod.rs.kernel/src/engine/default/json.rsandkernel/src/engine/default/parquet.rsalso contain important changes, but are not as large as the first three files. It is also worth paying some attention to the FFI changes inffi/src/lib.rs:get_default_engine_impl.