test: fix flaky metrics tests caused by tracing callsite cache poisoning by chiinlquah · Pull Request #2484 · delta-io/delta-kernel-rs

chiinlquah · 2026-04-28T06:04:52Z

What changes are proposed in this pull request?

Fixes intermittent failures in the metrics integration and unit tests where counters that should be non-zero (e.g. storage_read_calls, parquet_read_calls, json_read_calls) instead read 0, causing assertions like assert_eq!(reporter.storage_read_calls.get(), 2) to panic with left: 0, right: 2.

How was this PR tested?

New helper test_utils::install_thread_local_metrics_reporter is now called by every metrics-test setup site.
Tested by running the metrics integration test binary in tight loops at --test-threads = 1, 2, 4, 8, 16, 32, 64 and made sure that previous flaky test passes 100% now.

Detailed root cause

Symptom

Counters underlying metric callsites that emit from Drop impls (MetricsIterator::Drop in the default engine, emit_json_read_completed / emit_parquet_read_completed / emit_scan_metadata_completed in metrics::reporter) intermittently read zero in tests, even though the underlying I/O happened and other counters reported by #[instrument] annotations on the same code path (e.g. commit_files, log_segment_loads, snapshot_completions) are correct.

Root cause: `tracing-core` `Rebuilder::JustOne` consults the wrong dispatcher

The actual bug is caused by a tracing-core implementation detail that interacts badly with thread-local set_default.

Today, tracing caches per-callsite Interest process-globally. The cache is computed when a callsite is first registered (its first hit) by a function called rebuild_callsite_interest, which iterates all currently-registered dispatchers and their register_callsite results.

For a process/thread with no subsriber, register_callsite will get directed to NoSubscriber, which returns Interest::never(), and the cache for that callsite is locked to never for the rest of the process. Every subsequent emission of that source location -- on any thread, including threads with a real subscriber installed -- short-circuits to a disabled span at the cache check, never consulting the dispatcher.

The fix

Install a process-global bare Registry as the default subscriber once per test binary, via set_global_default(Registry::default()).

This makes dispatcher::get_default(...) on every thread (including builder-threads, tokio workers, libtest harness scaffolding) return a real subscriber instead of NoSubscriber and is the practical guidance for a multithreaded program.

codecov · 2026-04-28T06:09:20Z

Codecov Report

❌ Patch coverage is 95.83333% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 88.50%. Comparing base (78670eb) to head (e2ab216).

Files with missing lines	Patch %	Lines
test-utils/src/counting_reporter.rs	92.30%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #2484   +/-   ##
=======================================
  Coverage   88.50%   88.50%           
=======================================
  Files         179      179           
  Lines       59031    59046   +15     
  Branches    59031    59046   +15     
=======================================
+ Hits        52245    52259   +14     
  Misses       4785     4785           
- Partials     2001     2002    +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

scovich · 2026-04-29T13:56:09Z

For a process/thread with no subsriber, register_callsite will get directed to NoSubscriber, which returns Interest::never(), and the cache for that callsite is locked to never for the rest of the process. Every subsequent emission of that source location -- on any thread, including threads with a real subscriber installed -- short-circuits to a disabled span at the cache check, never consulting the dispatcher.

I'm not sure I understand. This seems like a pretty blatant upstream bug? Why would it be correct to assume that a thread/process will never have a subscriber merely because it currently lacks one? Or does the API mandate rebuilding the cache whenever new subscribers are installed, and we were just failing to do so before?

chiinlquah · 2026-04-29T18:41:57Z

per link, when a new Dispatch is created (i.e. a new subscriber becomes active), any previously cached Interest values are re-evaluated for all callsites in the program.

The problem here is that we were using set_default which only affects the calling thread which opens to races in a multithreaded env. The practical guidance/fix is straightforward, basically for a multithreaded app, we should always use set_global_default once at startup as the baseline, and reach for set_default only when you have a specific reason to override per-thread or per-scope.

The metrics tests intermittently fail with counters stuck at 0 because the default engine's `MetricsIterator::Drop` emits via `tracing::span!`, and the tracing callsite interest cache can be poisoned process-globally when a thread without a subscriber is the first to hit the callsite. Introduces `test_utils::ensure_metrics_compatible_global_subscriber` -- a `OnceLock` guarded `set_global_default(Registry::default())` -- and calls it before each per-test thread-local `set_default`. Also documents the gotcha on the public `WithMetricsReporterLayer::with_metrics_reporter_layer` so production users who install via thread-local `set_default` know to do the same thing.

Address review on PR delta-io#2484: - kyli87: add `install_thread_local_metrics_reporter` so test writers no longer have to remember to call `ensure_metrics_compatible_global_subscriber` before installing the thread-local subscriber. Migrate all six existing call sites (kernel internal tests in `parallel_phase`, `scan`, and `snapshot::builder`, plus the integration test `measuring_engine`). The kernel-internal helper is a near-duplicate of the test-utils one because `MetricsReporter`'s trait identity differs across the test_utils <-> kernel path-dep boundary; both helpers wrap the same `ensure_metrics_compatible_global_subscriber` + `set_default` sequence. - nicklan (cache-rebuild fallback): already addressed in 97d01f6. - nicklan (rebuild-cache vs global default): already addressed in 97d01f6 -- the public doc explicitly says rebuild_interest_cache is not a reliable substitute on its own.

github-actions Bot assigned chiinlquah Apr 28, 2026

chiinlquah requested a review from nicklan April 28, 2026 22:52

nicklan reviewed Apr 29, 2026

View reviewed changes

Comment thread test-utils/src/counting_reporter.rs Outdated

Comment thread kernel/src/metrics/mod.rs Outdated

chiinlquah added 2 commits April 29, 2026 23:49

refactor: tighten callsite cache rebuild and refine guidance

97d01f6

chiinlquah force-pushed the chiin/fix-metrics-flake-tracing branch from 7ab02ee to 97d01f6 Compare April 29, 2026 23:49

chiinlquah requested review from nicklan and thinh2 April 29, 2026 23:55

chiinlquah enabled auto-merge April 29, 2026 23:55

kyli87 reviewed Apr 30, 2026

View reviewed changes

Comment thread test-utils/src/counting_reporter.rs

chiinlquah added 2 commits May 4, 2026 17:28

Merge branch 'main' into chiin/fix-metrics-flake-tracing

e2ab216

chiinlquah requested a review from kyli87 May 4, 2026 18:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: fix flaky metrics tests caused by tracing callsite cache poisoning#2484

test: fix flaky metrics tests caused by tracing callsite cache poisoning#2484
chiinlquah wants to merge 4 commits intodelta-io:mainfrom
chiinlquah:chiin/fix-metrics-flake-tracing

chiinlquah commented Apr 28, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Apr 28, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

scovich commented Apr 29, 2026

Uh oh!

chiinlquah commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

chiinlquah commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes are proposed in this pull request?

How was this PR tested?

Detailed root cause

Symptom

Root cause: tracing-core Rebuilder::JustOne consults the wrong dispatcher

The fix

Uh oh!

codecov Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

scovich commented Apr 29, 2026

Uh oh!

chiinlquah commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

chiinlquah commented Apr 28, 2026 •

edited

Loading

Root cause: `tracing-core` `Rebuilder::JustOne` consults the wrong dispatcher

codecov Bot commented Apr 28, 2026 •

edited

Loading