test: fix flaky metrics tests caused by tracing callsite cache poisoning#2484
test: fix flaky metrics tests caused by tracing callsite cache poisoning#2484chiinlquah wants to merge 4 commits intodelta-io:mainfrom
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2484 +/- ##
=======================================
Coverage 88.50% 88.50%
=======================================
Files 179 179
Lines 59031 59046 +15
Branches 59031 59046 +15
=======================================
+ Hits 52245 52259 +14
Misses 4785 4785
- Partials 2001 2002 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
I'm not sure I understand. This seems like a pretty blatant upstream bug? Why would it be correct to assume that a thread/process will never have a subscriber merely because it currently lacks one? Or does the API mandate rebuilding the cache whenever new subscribers are installed, and we were just failing to do so before? |
|
per link, when a new Dispatch is created (i.e. a new subscriber becomes active), any previously cached Interest values are re-evaluated for all callsites in the program. The problem here is that we were using set_default which only affects the calling thread which opens to races in a multithreaded env. The practical guidance/fix is straightforward, basically for a multithreaded app, we should always use set_global_default once at startup as the baseline, and reach for set_default only when you have a specific reason to override per-thread or per-scope. |
The metrics tests intermittently fail with counters stuck at 0 because the default engine's `MetricsIterator::Drop` emits via `tracing::span!`, and the tracing callsite interest cache can be poisoned process-globally when a thread without a subscriber is the first to hit the callsite. Introduces `test_utils::ensure_metrics_compatible_global_subscriber` -- a `OnceLock` guarded `set_global_default(Registry::default())` -- and calls it before each per-test thread-local `set_default`. Also documents the gotcha on the public `WithMetricsReporterLayer::with_metrics_reporter_layer` so production users who install via thread-local `set_default` know to do the same thing.
7ab02ee to
97d01f6
Compare
Address review on PR delta-io#2484: - kyli87: add `install_thread_local_metrics_reporter` so test writers no longer have to remember to call `ensure_metrics_compatible_global_subscriber` before installing the thread-local subscriber. Migrate all six existing call sites (kernel internal tests in `parallel_phase`, `scan`, and `snapshot::builder`, plus the integration test `measuring_engine`). The kernel-internal helper is a near-duplicate of the test-utils one because `MetricsReporter`'s trait identity differs across the test_utils <-> kernel path-dep boundary; both helpers wrap the same `ensure_metrics_compatible_global_subscriber` + `set_default` sequence. - nicklan (cache-rebuild fallback): already addressed in 97d01f6. - nicklan (rebuild-cache vs global default): already addressed in 97d01f6 -- the public doc explicitly says rebuild_interest_cache is not a reliable substitute on its own.
What changes are proposed in this pull request?
Fixes intermittent failures in the metrics integration and unit tests where counters that should be non-zero (e.g.
storage_read_calls,parquet_read_calls,json_read_calls) instead read 0, causing assertions likeassert_eq!(reporter.storage_read_calls.get(), 2)to panic withleft: 0, right: 2.How was this PR tested?
test_utils::install_thread_local_metrics_reporteris now called by every metrics-test setup site.--test-threads = 1, 2, 4, 8, 16, 32, 64and made sure that previous flaky test passes 100% now.Detailed root cause
Symptom
Counters underlying metric callsites that emit from
Dropimpls (MetricsIterator::Dropin the default engine,emit_json_read_completed/emit_parquet_read_completed/emit_scan_metadata_completedinmetrics::reporter) intermittently read zero in tests, even though the underlying I/O happened and other counters reported by#[instrument]annotations on the same code path (e.g.commit_files,log_segment_loads,snapshot_completions) are correct.Root cause:
tracing-coreRebuilder::JustOneconsults the wrong dispatcherThe actual bug is caused by a tracing-core implementation detail that interacts badly with thread-local
set_default.Today,
tracingcaches per-callsiteInterestprocess-globally. The cache is computed when a callsite is first registered (its first hit) by a function calledrebuild_callsite_interest, which iterates all currently-registered dispatchers and theirregister_callsiteresults.For a process/thread with no subsriber,
register_callsitewill get directed toNoSubscriber, which returnsInterest::never(), and the cache for that callsite is locked toneverfor the rest of the process. Every subsequent emission of that source location -- on any thread, including threads with a real subscriber installed -- short-circuits to a disabled span at the cache check, never consulting the dispatcher.The fix
Install a process-global bare
Registryas the default subscriber once per test binary, viaset_global_default(Registry::default()).This makes
dispatcher::get_default(...)on every thread (including builder-threads, tokio workers, libtest harness scaffolding) return a real subscriber instead ofNoSubscriberand is the practical guidance for a multithreaded program.