Skip to content

RUM-15138: Rebase trace sample rate against RUM session sample rate for correlated cross-product sampling#3342

Open
hamorillo wants to merge 6 commits intodevelopfrom
hector.morilloprieto/RUM-15138
Open

RUM-15138: Rebase trace sample rate against RUM session sample rate for correlated cross-product sampling#3342
hamorillo wants to merge 6 commits intodevelopfrom
hector.morilloprieto/RUM-15138

Conversation

@hamorillo
Copy link
Copy Markdown
Contributor

@hamorillo hamorillo commented Apr 7, 2026

What does this PR do?

DeterministicTraceSampler now applies cross-product rebasing using sessionSampleRate × traceSampleRate / 100 whenever a valid session sample rate is available. This rebasing is provider-driven and no longer gated by rum.session.id on the span.

The session sample rate is propagated from RumSessionScopeRumContext → SDK feature context. ApmNetworkInstrumentationConfiguration reads it at instrumentation setup time via a sessionSampleRateProvider lambda and passes it to DeterministicTraceSampler. A compatibility fallback still reads session_sample_rate from span tags for integrations that bypass ApmNetworkInstrumentationConfiguration.

_dd.agent_psr now reports the same effective sample rate used by sampling decisions in instrumentation paths, including rebased rates when applicable.

Motivation

Previously, rebasing behavior and effective-rate reporting were inconsistent across paths. This change makes rebasing policy explicit and consistent: when a session sample rate is available, trace sampling is rebased to preserve the intended cross-product relationship between session and trace sampling.

Additional Notes

Session rate stability: RumSessionScope snapshots sessionSampleRate at renewSession() time rather than calling getSampleRate() on every getRumContext() invocation, so session-level metadata remains stable for the full session lifetime.

RumContextKeys: A new internal object centralises the session_sample_rate key in the trace module.

Follow-up

❓ 💡 We should consider, in the next major version, making DeterministicTraceSampler non-public so all instances are SDK-controlled. That would let us remove the span-tag compatibility fallback (session_sample_rate) and simplify sampling to a single internal effective-rate source.

Review checklist (to be filled by reviewers)

  • Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
  • Make sure you discussed the feature or bugfix with the maintaining team in an Issue
  • Make sure each commit and the PR mention the Issue number (cf the CONTRIBUTING doc)

@datadog-prod-us1-5

This comment has been minimized.

@hamorillo hamorillo force-pushed the hector.morilloprieto/RUM-15138 branch from 6d2dcd7 to de100fe Compare April 7, 2026 13:06
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 7, 2026

Codecov Report

❌ Patch coverage is 79.59184% with 20 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.88%. Comparing base (1425287) to head (8f392fd).
⚠️ Report is 24 commits behind head on develop.

Files with missing lines Patch % Lines
...id/trace/ApmNetworkInstrumentationConfiguration.kt 80.00% 3 Missing and 1 partial ⚠️
...datadog/android/trace/DeterministicTraceSampler.kt 78.95% 0 Missing and 4 partials ⚠️
...ndroid/trace/internal/ApmNetworkInstrumentation.kt 82.35% 0 Missing and 3 partials ⚠️
...datadog/android/okhttp/trace/TracingInterceptor.kt 40.00% 2 Missing and 1 partial ⚠️
...dog/android/cronet/internal/DatadogCronetEngine.kt 50.00% 0 Missing and 2 partials ⚠️
...n/com/datadog/android/okhttp/DatadogInterceptor.kt 71.43% 1 Missing and 1 partial ⚠️
...android/internal/sampling/DeterministicSampling.kt 83.33% 0 Missing and 1 partial ⚠️
...trace/internal/net/ApmNetworkInstrumentationExt.kt 75.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #3342      +/-   ##
===========================================
+ Coverage    71.51%   71.88%   +0.37%     
===========================================
  Files          946      947       +1     
  Lines        34916    34988      +72     
  Branches      5920     5823      -97     
===========================================
+ Hits         24968    25149     +181     
+ Misses        8280     8237      -43     
+ Partials      1668     1602      -66     
Files with missing lines Coverage Δ
...adog/android/core/sampling/DeterministicSampler.kt 80.95% <100.00%> (-4.23%) ⬇️
.../datadog/android/rum/internal/domain/RumContext.kt 100.00% <100.00%> (ø)
...droid/rum/internal/domain/scope/RumSessionScope.kt 91.77% <100.00%> (+0.80%) ⬆️
...dog/android/trace/internal/RumContextPropagator.kt 77.97% <100.00%> (+4.75%) ⬆️
...android/internal/sampling/DeterministicSampling.kt 83.33% <83.33%> (ø)
...trace/internal/net/ApmNetworkInstrumentationExt.kt 82.76% <75.00%> (+13.53%) ⬆️
...dog/android/cronet/internal/DatadogCronetEngine.kt 80.85% <50.00%> (-0.97%) ⬇️
...n/com/datadog/android/okhttp/DatadogInterceptor.kt 81.72% <71.43%> (+1.16%) ⬆️
...ndroid/trace/internal/ApmNetworkInstrumentation.kt 62.69% <82.35%> (+1.21%) ⬆️
...datadog/android/okhttp/trace/TracingInterceptor.kt 83.92% <40.00%> (+2.11%) ⬆️
... and 2 more

... and 107 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@hamorillo hamorillo force-pushed the hector.morilloprieto/RUM-15138 branch 2 times, most recently from 7e72bb7 to 5320e34 Compare April 8, 2026 06:19
…or correlated cross-product sampling

DeterministicTraceSampler now applies a cross-product rebasing formula when a span carries a RUM session ID tag: rebasedRate = sessionSampleRate × traceSampleRate / 100. Previously, the raw traceSampleRate was used as the sampling threshold regardless of whether the session ID was the hash key, causing the effective trace rate among RUM-tracked sessions to be incorrect.

The session sample rate is propagated via RumContext (new sessionSampleRate field) into the SDK feature context, then written onto spans as a tag by RumContextPropagator alongside the existing RUM tags. DeterministicTraceSampler reads the tag at sampling time and applies the rebasing; spans without a session ID tag continue to use the raw trace rate unchanged.

RumSessionScope snapshots sessionSampleRate in renewSession() at the moment the sampling decision is made, ensuring the rate is immutable for the lifetime of that session.
@hamorillo hamorillo force-pushed the hector.morilloprieto/RUM-15138 branch from 5320e34 to b6a78e8 Compare April 8, 2026 09:41
@hamorillo hamorillo marked this pull request as ready for review April 8, 2026 12:23
@hamorillo hamorillo requested review from a team as code owners April 8, 2026 12:23
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b6a78e8689

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

ambushwork
ambushwork previously approved these changes Apr 8, 2026
@hamorillo hamorillo marked this pull request as draft April 8, 2026 13:17
*/
open class DeterministicTraceSampler(
sampleRateProvider: () -> Float
open class DeterministicTraceSampler private constructor(
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ 💡 open on this class created some friction — external subclasses silently bypass the sessionSampleRateProvider wiring and can't access getSampleRate(DatadogSpan), both of which are needed for correct rebasing. Worth making this internal or at least final in the next major version. Or do we have a good reason to have it open. Wdyt?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@satween might be the best person to discuss this with.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

open is the legacy came from the V2 version of the SDK. Please add todo with link to V4 ticket.

@hamorillo
Copy link
Copy Markdown
Contributor Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4ca59be245

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@hamorillo hamorillo force-pushed the hector.morilloprieto/RUM-15138 branch from 4ca59be to a7b9b0b Compare April 9, 2026 21:38
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Keep them coming!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@hamorillo
Copy link
Copy Markdown
Contributor Author

Thanks for your initial review, @ambushwork! While reviewing this comment, I tried to simplify and improve the changes, so it would be nice if you could take a second pass. 🙏

@hamorillo hamorillo marked this pull request as ready for review April 10, 2026 06:49
@hamorillo hamorillo requested a review from ambushwork April 10, 2026 06:49
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a7b9b0b93b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

…hing the actual sampling decision

applyPriority() and RequestTracingState.sampleRate now use a span-aware effectiveSampleRate() lookup instead of the argument-less getSampleRate(). For spans linked to an active RUM session the rebased rate is reported; for spans without a session ID the raw trace rate is reported, eliminating the mismatch where the backend received a rebased PSR for non-session spans sampled at the raw trace rate.

DeterministicTraceSampler gains an internal getSampleRate(DatadogSpan) method that resolves the effective rate per span, with a compatibility fallback to the session_sample_rate span tag for integrations that do not use ApmNetworkInstrumentationConfiguration. RumContextKeys centralises the session_sample_rate string constant across the trace module.
@hamorillo hamorillo force-pushed the hector.morilloprieto/RUM-15138 branch from a7b9b0b to e50d3bf Compare April 10, 2026 07:17

/** @inheritDoc */
override fun sample(item: DatadogSpan): Boolean {
val sampleRate = getSampleRate(item)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like exactly the same code as DeterministicSampler.sample. Is it possible not to have the same algorithm implemented twice?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

I would say to do that we need to extract that to a "public" place (or at least accesible for subclasses) for example:

DeterministicSampler.kt

protected fun computeSamplingDecision(sampleRate: Float, id: ULong): Boolean {
    return when {
        sampleRate >= SAMPLE_ALL_RATE -> true
        sampleRate <= 0f -> false
        else -> {
            val hash = id * SAMPLER_HASHER
            val threshold = (MAX_ID.toDouble() * sampleRate / SAMPLE_ALL_RATE).toULong()
            hash < threshold
        }
    }
}

---
DeterministicSampler.kt

override fun sample(item: T): Boolean =
    computeSamplingDecision(getSampleRate(), idConverter(item))
    
---

DeterministicTraceSampler.kt    
override fun sample(item: DatadogSpan): Boolean =
    computeSamplingDecision(getSampleRate(item), SpanSamplingIdProvider.provideId(item))    

The reason the sample override exists at all is to call getSampleRate(item) instead of getSampleRate() — the item-aware version is needed to support the legacy instantiation path where no sessionSampleRateProvider is wired up and the session rate is read from the span tag instead. The parent's sample would silently miss that fallback.

Unless you consider this a blocker, I'd prefer to keep the duplication for now and avoid widening the API surface. In a future major version, if the legacy constructors can be removed, the override could be dropped entirely and the parent's sample would suffice.

Wdyt?

Copy link
Copy Markdown
Contributor

@aleksandr-gringauz aleksandr-gringauz Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to keep the duplication for now and avoid widening the API surface

Can you make computeSamplingDecision in your proposed solution internal instead of protected and make it some standalone function not tied to DeterministicSampler or DeterministicTraceSampler? This way you don't have to change the API of DeterministicSampler at all.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed I've moved the computeSamplingDecision to dd-sdk-android-internal.

…ernal

Moves the core hash-and-threshold computation into a shared top-level function `computeSamplingDecision` in `dd-sdk-android-internal`, removing the duplicated implementation from both `DeterministicSampler` and `DeterministicTraceSampler`. Adds unit tests for the extracted function.
val sessionSampleRateProvider: () -> Float = {
resolveSessionSampleRate(
(sdkRef.get() as? InternalSdkCore)
?.getFeatureContext(Feature.RUM_FEATURE_NAME, useContextThread = false)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getFeatureContext is a dangerous call, because it involves locking. Maybe it is better to use FeatureContextUpdateReceiver?

Copy link
Copy Markdown
Contributor Author

@hamorillo hamorillo Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codex mentioned something similar in this comment #3342 (comment) so I added useContextThread = false.

IIUC it still acquires a readLock which is where you see the danger, right? Anyway, let me try your suggestion as it looks safer.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion applied!

Copy link
Copy Markdown
Contributor

@satween satween left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two additional things to check before proceed:

  1. ApmInstrumentation is an experimental API that we gonna move to from TracingInterceptor/DatadogInterceptor and that already used in Cronet. But untill then - you have to support TracingInterceptor/DatadogInterceptor as well.

  2. Note that there is another effective sampling calculation logic that is applied to the telemetry event. Make sure that your changes properly works with that one

*/
open class DeterministicTraceSampler(
sampleRateProvider: () -> Float
open class DeterministicTraceSampler private constructor(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

open is the legacy came from the V2 version of the SDK. Please add todo with link to V4 ticket.


internal fun Sampler<DatadogSpan>.effectiveSampleRate(span: DatadogSpan): Float? {
return when (this) {
is DeterministicTraceSampler -> getSampleRate(span)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not the best API design, as this layer becomes aware of the implementation of the Sampler interface.

Maybe we should add getSampleRate(T) method with the default implementation with fallback to getSampleRate()?

this will keep implementation isolated from the caller side + wont break backward compatibility. And in DeterministicTraceSampler you could implement independent behaviour. WDYT?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. I totally agree that this is far from an ideal solution.

I considered that option before implementation but adding getSampleRate(T) on the Sampler interface sounds like an unnecessary refactor considering that and some point we'll be able to get rid of that method. We only need getSampleRate(T) due to the legacy classes but when remove and make those private/internal we won't need that method anymore.

The extension function with the when (this is ...) check is admittedly not clean, but it keeps the coupling contained and co-located with the V4 TODO, rather than leaving a footprint on the interface.


internal object RumContextKeys {
// Must match RumContext.SESSION_SAMPLE_RATE in dd-sdk-android-rum.
internal const val SESSION_SAMPLE_RATE = "session_sample_rate"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move it to the LogAttributes as they used for the RUM context propagation anyway. We should move them out of the LogAttributes for sure, but it should be made with the major version update. Before that moment, let's reduce files that holds constants for RUM context propagation

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done c8f50f9

* @param id a stable numerical identifier derived from the item being sampled.
* @return true if the item should be sampled, false otherwise.
*/
fun computeSamplingDecision(sampleRate: Float, id: ULong): Boolean {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you moved this code out of the DeterministicSampler ? It's seems like only descendants of it are using it...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can see the discussion: #3342 (comment)

…pdateReceiver for session sample rate caching
Copy link
Copy Markdown
Contributor Author

@hamorillo hamorillo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review @satween. Really appreciate it.

But until then - you have to support TracingInterceptor/DatadogInterceptor as well.

This part should be already covered as they are using com.datadog.android.trace.DeterministicTraceSampler and we've implemented the getSampleRate(item: DatadogSpan): Float to cover that case where we need to extract the RUM session sample rate from the span.

While reviewing your comment is true that I found an issue where we weren't sending the right sample rate in _dd.agent_psr and _dd.rule_psr. Should be fixed here with a similar (not ideal) approach. For more context see: #3342 (comment)

Note that there is another effective sampling calculation logic that is applied to the telemetry event. Make sure that your changes properly works with that one

I would say, that part shouldn't be affected by the changes in this PR.


internal fun Sampler<DatadogSpan>.effectiveSampleRate(span: DatadogSpan): Float? {
return when (this) {
is DeterministicTraceSampler -> getSampleRate(span)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. I totally agree that this is far from an ideal solution.

I considered that option before implementation but adding getSampleRate(T) on the Sampler interface sounds like an unnecessary refactor considering that and some point we'll be able to get rid of that method. We only need getSampleRate(T) due to the legacy classes but when remove and make those private/internal we won't need that method anymore.

The extension function with the when (this is ...) check is admittedly not clean, but it keeps the coupling contained and co-located with the V4 TODO, rather than leaving a footprint on the interface.

val sessionSampleRateProvider: () -> Float = {
resolveSessionSampleRate(
(sdkRef.get() as? InternalSdkCore)
?.getFeatureContext(Feature.RUM_FEATURE_NAME, useContextThread = false)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion applied!


internal object RumContextKeys {
// Must match RumContext.SESSION_SAMPLE_RATE in dd-sdk-android-rum.
internal const val SESSION_SAMPLE_RATE = "session_sample_rate"
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done c8f50f9

…in legacy interceptors

Make DeterministicTraceSampler.getSampleRate(DatadogSpan) public so the legacy
TracingInterceptor and DatadogInterceptor paths can call the span-aware overload
at the PSR reporting sites. The type-check pattern mirrors the existing
effectiveSampleRate approach in ApmNetworkInstrumentationExt.

The session_sample_rate tag is already written onto the span by extractRumContext
before the PSR metric is set, so the tag-based fallback fires correctly for
samplers created via the legacy public constructors.

getSampleRate(DatadogSpan) will be removed when the legacy interceptor path is
retired in v4.
Replace the internal RumContextKeys object (single-constant file in
dd-sdk-android-trace) with LogAttributes.RUM_SESSION_SAMPLE_RATE in
dd-sdk-android-core, consistent with where the other RUM context
propagation keys (RUM_SESSION_ID, RUM_VIEW_ID, etc.) are defined.
@hamorillo hamorillo force-pushed the hector.morilloprieto/RUM-15138 branch from d5d716f to 8f392fd Compare April 14, 2026 15:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants