Skip to content

fix!: add container prefix support for default engine#2394

Draft
Jameson-Crate wants to merge 1 commit intodelta-io:mainfrom
Jameson-Crate:stack/azure-https-url
Draft

fix!: add container prefix support for default engine#2394
Jameson-Crate wants to merge 1 commit intodelta-io:mainfrom
Jameson-Crate:stack/azure-https-url

Conversation

@Jameson-Crate
Copy link
Copy Markdown
Collaborator

@Jameson-Crate Jameson-Crate commented Apr 14, 2026

What changes are proposed in this pull request?

Addresses #2209 by computing and storing the container path prefix during object store creation. The prefix is stored as an object store path which is empty in the majority of cases and a single path segment in other cases (although this implementation supports container prefixes of arbitrary length). This object store prefix is then plumbed through the default engine to each of the respective default handlers.

Going forward connectors will use a similar flow to the one exercised today, but instead of returning a store, store_from_url will return a new struct ParsedStore which contains both of the object store and the path prefix. Default engine will now take a ParsedStore object.

How was this change tested?

Added new unit tests and passing all of the previous tests. Additionally ran integration testing with a https://*.blob.core.windows.net/ URL. For this test I set up an azure instance with the error-prone url type and tested creating a snapshot from it before and after the change. Before the change the listing shows there are no log files because the path is incorrect, after kernel is able to correctly list from the table.

Before Fix:
image

After Fix:
image

Note to Reviewers

This change looks very large, but many of the updates are purely mechanical. The most critical changes live in three files: kernel/src/engine/default/storage.rs, kernel/src/engine/default/filesystem.rs, and kernel/src/engine/default/mod.rs. kernel/src/engine/default/json.rs and kernel/src/engine/default/parquet.rs also contain important changes, but are not as large as the first three files. It is also worth paying some attention to the FFI changes in ffi/src/lib.rs:get_default_engine_impl.

@github-actions github-actions Bot added the breaking-change Public API change that could cause downstream compilation failures. Requires a major version bump. label Apr 14, 2026
@Jameson-Crate Jameson-Crate force-pushed the stack/azure-https-url branch from b655e68 to 70f7595 Compare April 14, 2026 22:20
@Jameson-Crate
Copy link
Copy Markdown
Collaborator Author

Range-diff: main (b655e68 -> 70f7595)
kernel/src/engine/default/parquet.rs
@@ -208,4 +208,11 @@
 +        );
  
          let data = Box::new(ArrowEngineData::new(
-             RecordBatch::try_from_iter(vec![(
\ No newline at end of file
+             RecordBatch::try_from_iter(vec![(
+         let parquet_handler: Arc<dyn ParquetHandler> = Arc::new(DefaultParquetHandler::new(
+             store.clone(),
+             Arc::new(TokioBackgroundExecutor::new()),
++            Path::from(""),
+         ));
+ 
+         let engine_data: Box<dyn EngineData> = Box::new(ArrowEngineData::new(
\ No newline at end of file

Reproduce locally: git range-diff 975deb1..b655e68 975deb1..70f7595 | Disable: git config gitstack.push-range-diff false

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 14, 2026

Codecov Report

❌ Patch coverage is 92.30769% with 40 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.54%. Comparing base (78670eb) to head (88e33ca).

Files with missing lines Patch % Lines
kernel/src/engine/default/storage.rs 89.23% 8 Missing and 6 partials ⚠️
kernel/src/engine/default/filesystem.rs 92.68% 0 Missing and 6 partials ⚠️
kernel/src/engine/default/parquet.rs 86.04% 0 Missing and 6 partials ⚠️
delta-kernel-unity-catalog/src/lib.rs 33.33% 4 Missing ⚠️
kernel/src/snapshot/mod.rs 91.30% 4 Missing ⚠️
kernel/src/engine/default/json.rs 91.17% 0 Missing and 3 partials ⚠️
test-utils/src/lib.rs 80.00% 0 Missing and 2 partials ⚠️
ffi/src/lib.rs 96.15% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2394      +/-   ##
==========================================
+ Coverage   88.50%   88.54%   +0.04%     
==========================================
  Files         179      179              
  Lines       59031    59383     +352     
  Branches    59031    59383     +352     
==========================================
+ Hits        52245    52582     +337     
- Misses       4785     4795      +10     
- Partials     2001     2006       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Jameson-Crate Jameson-Crate force-pushed the stack/azure-https-url branch 3 times, most recently from 8a585e4 to ea96e36 Compare April 17, 2026 23:08
@Jameson-Crate Jameson-Crate requested a review from nicklan April 20, 2026 17:17
@Jameson-Crate Jameson-Crate marked this pull request as ready for review April 20, 2026 17:19
@Jameson-Crate Jameson-Crate requested a review from kyli87 April 22, 2026 00:20
Comment thread kernel/src/engine/default/mod.rs Outdated
///
/// Prefer [`DefaultEngineBuilder::from_url`] when you don't already have a pre-built
/// object store -- it computes the store and prefix in one step.
pub fn new(object_store: Arc<DynObjectStore>, url_path_prefix: Path) -> Self {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't fit the builder pattern? The whole point of the pattern is to avoid adding a bunch of params to a constructor (especially those that are effectively optional). Instead, I think you should:

  1. Leave the existing new constructor as is (then you don't need to update every single caller to pass in empty path)
  2. Replace from_url with with_url_prefix that allows setting a prefix for the engine being built
  3. When build() is called, if no Path was set for url_prefix, construct the handlers with empty path

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you are describing was the change I had initially, but I opted for this change instead because there is essentially no time that you should be passing in the object store you construct without passing in the path prefix, since these are both always a requirement it makes sense to have them both in the constructor of the builder rather than to have path prefix passed through a builder method.

With this change we would instead prefer that engines are constructed using from_url primarily and only use new when they want more control over anything relating to object store creation. The majority of connectors should fall into the first bucket and won't need to worry about path prefixes or object stores at all and the connectors that do want more control over the object store and what happens before it is passed to engine have the option but are now forced to consider path prefixes.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because there is essentially no time that you should be passing in the object store you construct without passing in the path prefix

The majority of connectors should fall into the first bucket and won't need to worry about path prefixes

I think these two statements are contradictory? Based on the current usage pattern, most of the time you are passing in empty path prefix? So to me that indicates we should just have a default of path_prefix = "" - which is what the builder pattern is for?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handled Offline -- decided to move forward by keeping path_prefix as a required argument for the DefaultEngineBuilder constructor while coupling the store and path prefix into a single type which shows they are related

Comment thread kernel/src/engine/default/storage.rs Outdated
Comment on lines +161 to +173
let prefix_parts: Vec<_> = url_path_prefix.parts().collect();
let full_parts: Vec<_> = full.parts().collect();
if full_parts.len() < prefix_parts.len() || full_parts[..prefix_parts.len()] != prefix_parts[..]
{
return Err(DeltaError::generic(format!(
"URL path '{}' does not start with expected prefix '{}'",
full.as_ref(),
url_path_prefix.as_ref(),
)));
}
Ok(Path::from_iter(
full_parts[prefix_parts.len()..].iter().cloned(),
))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason you can't use Path::prefix_match for this? https://docs.rs/object_store/latest/object_store/path/struct.Path.html#method.prefix_match

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great callout

let path = Path::parse(url.path())?;
Ok((Box::new(store), path))
}
#[rstest]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have (or is it possible to add) tests at the store_from_url level? These current tests do not actually validate the scheme-specific behavior, so if our assumptions about how the scheme works is wrong then we would still have bugs. i.e are we sure that object_store::arse_url_opts actually behaves the way we expect for azure vs s3 urls?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had set up a small azure container instance with the format that was broken and used delta kernel rs to create a table and list from the table (results of which are in the screenshots in the pr description). It would be difficult to add this test to our unit / integration testing because it relies on a running azure instance, but I could certainly create a mock object store which is able to test store_from_url through more thorough unit testing.

@kyli87
Copy link
Copy Markdown
Collaborator

kyli87 commented Apr 22, 2026

Also the issue linked in your PR description is wrong - I assume it should be this one instead? #2209

@Jameson-Crate
Copy link
Copy Markdown
Collaborator Author

Also the issue linked in your PR description is wrong - I assume it should be this one instead? #2209

good callout, thanks

@Jameson-Crate Jameson-Crate marked this pull request as draft April 24, 2026 16:15
## What changes are proposed in this pull request?

This release has all main branch commits up to Wed Apr 29 17:13:32 2026,
with SHA= 14e289f.
@Jameson-Crate Jameson-Crate force-pushed the stack/azure-https-url branch from ea96e36 to 88e33ca Compare May 4, 2026 10:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking-change Public API change that could cause downstream compilation failures. Requires a major version bump.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants