Skip to content

feat: respect randomizeFilePrefixes and randomPrefixLength in write_dir#2513

Open
CynicDog wants to merge 1 commit intodelta-io:mainfrom
CynicDog:feat/respect-randomize-file-prefixes-2357
Open

feat: respect randomizeFilePrefixes and randomPrefixLength in write_dir#2513
CynicDog wants to merge 1 commit intodelta-io:mainfrom
CynicDog:feat/respect-randomize-file-prefixes-2357

Conversation

@CynicDog
Copy link
Copy Markdown
Contributor

@CynicDog CynicDog commented May 2, 2026

WriteContext::write_dir() now emits a random alphanumeric prefix whenever column mapping is on or delta.randomizeFilePrefixes is set, and the prefix length is driven by delta.randomPrefixLength (default 2). When a random prefix is used on a partitioned table, Hive-style path components are suppressed; partition values are still recorded in add.partitionValues. This matches Delta-Spark's getRandomPrefix / DelayedCommitProtocol behavior so kernel writers produce a compatible on-disk layout.

Closes #2357.

What changes are proposed in this pull request?

WriteContext::write_dir() previously emitted a random 2-char alphanumeric prefix only when column mapping was enabled. This PR makes it also respect the two table properties Delta-Spark uses for the same purpose:

  • delta.randomizeFilePrefixes (bool, default false) — when true, a random prefix is emitted regardless of column mapping mode.
  • delta.randomPrefixLength (positive int, default 2) — controls the prefix length for both the column-mapping path and the new randomize path.

Both properties were already deserialized into TableProperties; the implementation is purely in the write path:

  • Transaction::shared_write_state reads the two properties from effective_table_config.table_properties() and stores resolved values on SharedWriteState.
  • random_alphanumeric_prefix(len) is now parameterized by length.
  • write_dir() uses a single should_prefix decision: a random prefix is used iff column mapping is on or randomizeFilePrefixes is set.
    When a random prefix is used on a partitioned table, Hive-style path components are suppressed; partition values are still recorded in add.partitionValues.

How was this change tested?

  • All existing write_context unit tests pass unchanged (make_write_context is now a thin wrapper around a new helper that also accepts the randomize parameters).
  • New rstest matrices added in kernel/src/transaction/write_context.rs:
    • test_write_dir_with_randomize_property — cartesian over cm_mode × randomize × is_partitioned.
    • test_write_dir_random_prefix_length_property — verifies prefix length is honored for lengths 1, 2, and 16 in both the CM-on and randomize=true paths.
    • test_write_dir_cm_off_randomize_suppresses_hive — pins the decision flagged in "Open question" above so a future flip is deliberate.
  • test_random_alphanumeric_prefix_format extended to loop over multiple lengths (1, 2, 8, 32).
  • Full pre-push pipeline clean:
    • cargo +nightly fmt
    • cargo clippy --workspace --benches --tests --all-features -- -D warnings
    • cargo doc --workspace --all-features --no-deps
    • cargo nextest run --workspace --all-features (one unrelated pre-existing FFI trybuild failure on main).

WriteContext::write_dir() now emits a random alphanumeric prefix whenever
column mapping is on or delta.randomizeFilePrefixes is set, and the prefix
length is driven by delta.randomPrefixLength (default 2). When a random
prefix is used on a partitioned table, Hive-style path components are
suppressed; partition values are still recorded in add.partitionValues.
This matches Delta-Spark's getRandomPrefix / DelayedCommitProtocol behavior
so kernel writers produce a compatible on-disk layout.
@CynicDog
Copy link
Copy Markdown
Contributor Author

CynicDog commented May 2, 2026

@scottsand-db wanted to flag one thing I wasn't sure about: for the CM=None + partitioned + randomizeFilePrefixes=true case, I went with the random prefix replacing the Hive path (so <root>/<prefix>/<uuid>.parquet, with partition values still recorded in add.partitionValues).

My read of Spark's DelayedCommitProtocol.getFilename is that it picks one or the other rather than stacking them — but if that's not what you had in mind, happy to flip it to prefix-then-Hive.

The change would be a small tweak in write_dir() plus updating test_write_dir_cm_off_randomize_suppresses_hive.

@nicklan nicklan requested review from DrakeLin and scottsand-db May 4, 2026 18:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Respect randomizeFilePrefixes table property in write_dir

2 participants