Skip to content

test: fix flaky metrics tests caused by tracing callsite cache poisoning#2484

Open
chiinlquah wants to merge 4 commits intodelta-io:mainfrom
chiinlquah:chiin/fix-metrics-flake-tracing
Open

test: fix flaky metrics tests caused by tracing callsite cache poisoning#2484
chiinlquah wants to merge 4 commits intodelta-io:mainfrom
chiinlquah:chiin/fix-metrics-flake-tracing

Conversation

@chiinlquah
Copy link
Copy Markdown
Collaborator

@chiinlquah chiinlquah commented Apr 28, 2026

What changes are proposed in this pull request?

Fixes intermittent failures in the metrics integration and unit tests where counters that should be non-zero (e.g. storage_read_calls, parquet_read_calls, json_read_calls) instead read 0, causing assertions like assert_eq!(reporter.storage_read_calls.get(), 2) to panic with left: 0, right: 2.

How was this PR tested?

  • New helper test_utils::install_thread_local_metrics_reporter is now called by every metrics-test setup site.
  • Tested by running the metrics integration test binary in tight loops at --test-threads = 1, 2, 4, 8, 16, 32, 64 and made sure that previous flaky test passes 100% now.

Detailed root cause

Symptom

Counters underlying metric callsites that emit from Drop impls (MetricsIterator::Drop in the default engine, emit_json_read_completed / emit_parquet_read_completed / emit_scan_metadata_completed in metrics::reporter) intermittently read zero in tests, even though the underlying I/O happened and other counters reported by #[instrument] annotations on the same code path (e.g. commit_files, log_segment_loads, snapshot_completions) are correct.

Root cause: tracing-core Rebuilder::JustOne consults the wrong dispatcher

The actual bug is caused by a tracing-core implementation detail that interacts badly with thread-local set_default.

Today, tracing caches per-callsite Interest process-globally. The cache is computed when a callsite is first registered (its first hit) by a function called rebuild_callsite_interest, which iterates all currently-registered dispatchers and their register_callsite results.

For a process/thread with no subsriber, register_callsite will get directed to NoSubscriber, which returns Interest::never(), and the cache for that callsite is locked to never for the rest of the process. Every subsequent emission of that source location -- on any thread, including threads with a real subscriber installed -- short-circuits to a disabled span at the cache check, never consulting the dispatcher.

The fix

Install a process-global bare Registry as the default subscriber once per test binary, via set_global_default(Registry::default()).

This makes dispatcher::get_default(...) on every thread (including builder-threads, tokio workers, libtest harness scaffolding) return a real subscriber instead of NoSubscriber and is the practical guidance for a multithreaded program.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 28, 2026

Codecov Report

❌ Patch coverage is 95.83333% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 88.50%. Comparing base (78670eb) to head (e2ab216).

Files with missing lines Patch % Lines
test-utils/src/counting_reporter.rs 92.30% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2484   +/-   ##
=======================================
  Coverage   88.50%   88.50%           
=======================================
  Files         179      179           
  Lines       59031    59046   +15     
  Branches    59031    59046   +15     
=======================================
+ Hits        52245    52259   +14     
  Misses       4785     4785           
- Partials     2001     2002    +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@chiinlquah chiinlquah requested a review from nicklan April 28, 2026 22:52
Comment thread test-utils/src/counting_reporter.rs Outdated
Comment thread kernel/src/metrics/mod.rs Outdated
@scovich
Copy link
Copy Markdown
Collaborator

scovich commented Apr 29, 2026

For a process/thread with no subsriber, register_callsite will get directed to NoSubscriber, which returns Interest::never(), and the cache for that callsite is locked to never for the rest of the process. Every subsequent emission of that source location -- on any thread, including threads with a real subscriber installed -- short-circuits to a disabled span at the cache check, never consulting the dispatcher.

I'm not sure I understand. This seems like a pretty blatant upstream bug? Why would it be correct to assume that a thread/process will never have a subscriber merely because it currently lacks one? Or does the API mandate rebuilding the cache whenever new subscribers are installed, and we were just failing to do so before?

@chiinlquah
Copy link
Copy Markdown
Collaborator Author

per link, when a new Dispatch is created (i.e. a new subscriber becomes active), any previously cached Interest values are re-evaluated for all callsites in the program.

The problem here is that we were using set_default which only affects the calling thread which opens to races in a multithreaded env. The practical guidance/fix is straightforward, basically for a multithreaded app, we should always use set_global_default once at startup as the baseline, and reach for set_default only when you have a specific reason to override per-thread or per-scope.

The metrics tests intermittently fail with counters stuck at 0 because the
default engine's `MetricsIterator::Drop` emits via `tracing::span!`, and the
tracing callsite interest cache can be poisoned process-globally when a thread
without a subscriber is the first to hit the callsite. Introduces
`test_utils::ensure_metrics_compatible_global_subscriber` -- a `OnceLock`
guarded `set_global_default(Registry::default())` -- and calls it before each
per-test thread-local `set_default`. Also documents the gotcha on the public
`WithMetricsReporterLayer::with_metrics_reporter_layer` so production users
who install via thread-local `set_default` know to do the same thing.
@chiinlquah chiinlquah force-pushed the chiin/fix-metrics-flake-tracing branch from 7ab02ee to 97d01f6 Compare April 29, 2026 23:49
@chiinlquah chiinlquah requested review from nicklan and thinh2 April 29, 2026 23:55
@chiinlquah chiinlquah enabled auto-merge April 29, 2026 23:55
Comment thread test-utils/src/counting_reporter.rs
chiinlquah added 2 commits May 4, 2026 17:28
Address review on PR delta-io#2484:

- kyli87: add `install_thread_local_metrics_reporter` so test writers no
  longer have to remember to call `ensure_metrics_compatible_global_subscriber`
  before installing the thread-local subscriber. Migrate all six existing
  call sites (kernel internal tests in `parallel_phase`, `scan`, and
  `snapshot::builder`, plus the integration test `measuring_engine`).

  The kernel-internal helper is a near-duplicate of the test-utils one
  because `MetricsReporter`'s trait identity differs across the
  test_utils <-> kernel path-dep boundary; both helpers wrap the same
  `ensure_metrics_compatible_global_subscriber` + `set_default` sequence.

- nicklan (cache-rebuild fallback): already addressed in 97d01f6.
- nicklan (rebuild-cache vs global default): already addressed in
  97d01f6 -- the public doc explicitly says rebuild_interest_cache is
  not a reliable substitute on its own.
@chiinlquah chiinlquah requested a review from kyli87 May 4, 2026 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants