RUM-15138: Rebase trace sample rate against RUM session sample rate for correlated cross-product sampling#3342
RUM-15138: Rebase trace sample rate against RUM session sample rate for correlated cross-product sampling#3342
Conversation
This comment has been minimized.
This comment has been minimized.
6d2dcd7 to
de100fe
Compare
7e72bb7 to
5320e34
Compare
…or correlated cross-product sampling DeterministicTraceSampler now applies a cross-product rebasing formula when a span carries a RUM session ID tag: rebasedRate = sessionSampleRate × traceSampleRate / 100. Previously, the raw traceSampleRate was used as the sampling threshold regardless of whether the session ID was the hash key, causing the effective trace rate among RUM-tracked sessions to be incorrect. The session sample rate is propagated via RumContext (new sessionSampleRate field) into the SDK feature context, then written onto spans as a tag by RumContextPropagator alongside the existing RUM tags. DeterministicTraceSampler reads the tag at sampling time and applies the rebasing; spans without a session ID tag continue to use the raw trace rate unchanged. RumSessionScope snapshots sessionSampleRate in renewSession() at the moment the sampling decision is made, ensuring the rate is immutable for the lifetime of that session.
5320e34 to
b6a78e8
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b6a78e8689
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
.../dd-sdk-android-trace/src/main/kotlin/com/datadog/android/trace/DeterministicTraceSampler.kt
Outdated
Show resolved
Hide resolved
...android-rum/src/main/kotlin/com/datadog/android/rum/internal/domain/scope/RumSessionScope.kt
Show resolved
Hide resolved
...dd-sdk-android-rum/src/test/kotlin/com/datadog/android/rum/internal/domain/RumContextTest.kt
Outdated
Show resolved
Hide resolved
...dd-sdk-android-rum/src/test/kotlin/com/datadog/android/rum/internal/domain/RumContextTest.kt
Outdated
Show resolved
Hide resolved
...oid-rum/src/test/kotlin/com/datadog/android/rum/internal/domain/scope/RumSessionScopeTest.kt
Outdated
Show resolved
Hide resolved
| */ | ||
| open class DeterministicTraceSampler( | ||
| sampleRateProvider: () -> Float | ||
| open class DeterministicTraceSampler private constructor( |
There was a problem hiding this comment.
❓ 💡 open on this class created some friction — external subclasses silently bypass the sessionSampleRateProvider wiring and can't access getSampleRate(DatadogSpan), both of which are needed for correct rebasing. Worth making this internal or at least final in the next major version. Or do we have a good reason to have it open. Wdyt?
There was a problem hiding this comment.
@satween might be the best person to discuss this with.
There was a problem hiding this comment.
open is the legacy came from the V2 version of the SDK. Please add todo with link to V4 ticket.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4ca59be245
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
.../dd-sdk-android-trace/src/main/kotlin/com/datadog/android/trace/DeterministicTraceSampler.kt
Outdated
Show resolved
Hide resolved
4ca59be to
a7b9b0b
Compare
|
Codex Review: Didn't find any major issues. Keep them coming! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
Thanks for your initial review, @ambushwork! While reviewing this comment, I tried to simplify and improve the changes, so it would be nice if you could take a second pass. 🙏 |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a7b9b0b93b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
...id-trace/src/main/kotlin/com/datadog/android/trace/ApmNetworkInstrumentationConfiguration.kt
Outdated
Show resolved
Hide resolved
…hing the actual sampling decision applyPriority() and RequestTracingState.sampleRate now use a span-aware effectiveSampleRate() lookup instead of the argument-less getSampleRate(). For spans linked to an active RUM session the rebased rate is reported; for spans without a session ID the raw trace rate is reported, eliminating the mismatch where the backend received a rebased PSR for non-session spans sampled at the raw trace rate. DeterministicTraceSampler gains an internal getSampleRate(DatadogSpan) method that resolves the effective rate per span, with a compatibility fallback to the session_sample_rate span tag for integrations that do not use ApmNetworkInstrumentationConfiguration. RumContextKeys centralises the session_sample_rate string constant across the trace module.
a7b9b0b to
e50d3bf
Compare
|
|
||
| /** @inheritDoc */ | ||
| override fun sample(item: DatadogSpan): Boolean { | ||
| val sampleRate = getSampleRate(item) |
There was a problem hiding this comment.
This looks like exactly the same code as DeterministicSampler.sample. Is it possible not to have the same algorithm implemented twice?
There was a problem hiding this comment.
Good catch!
I would say to do that we need to extract that to a "public" place (or at least accesible for subclasses) for example:
DeterministicSampler.kt
protected fun computeSamplingDecision(sampleRate: Float, id: ULong): Boolean {
return when {
sampleRate >= SAMPLE_ALL_RATE -> true
sampleRate <= 0f -> false
else -> {
val hash = id * SAMPLER_HASHER
val threshold = (MAX_ID.toDouble() * sampleRate / SAMPLE_ALL_RATE).toULong()
hash < threshold
}
}
}
---
DeterministicSampler.kt
override fun sample(item: T): Boolean =
computeSamplingDecision(getSampleRate(), idConverter(item))
---
DeterministicTraceSampler.kt
override fun sample(item: DatadogSpan): Boolean =
computeSamplingDecision(getSampleRate(item), SpanSamplingIdProvider.provideId(item))
The reason the sample override exists at all is to call getSampleRate(item) instead of getSampleRate() — the item-aware version is needed to support the legacy instantiation path where no sessionSampleRateProvider is wired up and the session rate is read from the span tag instead. The parent's sample would silently miss that fallback.
Unless you consider this a blocker, I'd prefer to keep the duplication for now and avoid widening the API surface. In a future major version, if the legacy constructors can be removed, the override could be dropped entirely and the parent's sample would suffice.
Wdyt?
There was a problem hiding this comment.
I'd prefer to keep the duplication for now and avoid widening the API surface
Can you make computeSamplingDecision in your proposed solution internal instead of protected and make it some standalone function not tied to DeterministicSampler or DeterministicTraceSampler? This way you don't have to change the API of DeterministicSampler at all.
There was a problem hiding this comment.
As discussed I've moved the computeSamplingDecision to dd-sdk-android-internal.
…ernal Moves the core hash-and-threshold computation into a shared top-level function `computeSamplingDecision` in `dd-sdk-android-internal`, removing the duplicated implementation from both `DeterministicSampler` and `DeterministicTraceSampler`. Adds unit tests for the extracted function.
| val sessionSampleRateProvider: () -> Float = { | ||
| resolveSessionSampleRate( | ||
| (sdkRef.get() as? InternalSdkCore) | ||
| ?.getFeatureContext(Feature.RUM_FEATURE_NAME, useContextThread = false) |
There was a problem hiding this comment.
getFeatureContext is a dangerous call, because it involves locking. Maybe it is better to use FeatureContextUpdateReceiver?
There was a problem hiding this comment.
Codex mentioned something similar in this comment #3342 (comment) so I added useContextThread = false.
IIUC it still acquires a readLock which is where you see the danger, right? Anyway, let me try your suggestion as it looks safer.
There was a problem hiding this comment.
Suggestion applied!
satween
left a comment
There was a problem hiding this comment.
Two additional things to check before proceed:
-
ApmInstrumentationis an experimental API that we gonna move to fromTracingInterceptor/DatadogInterceptorand that already used inCronet. But untill then - you have to supportTracingInterceptor/DatadogInterceptoras well. -
Note that there is another effective sampling calculation logic that is applied to the telemetry event. Make sure that your changes properly works with that one
| */ | ||
| open class DeterministicTraceSampler( | ||
| sampleRateProvider: () -> Float | ||
| open class DeterministicTraceSampler private constructor( |
There was a problem hiding this comment.
open is the legacy came from the V2 version of the SDK. Please add todo with link to V4 ticket.
|
|
||
| internal fun Sampler<DatadogSpan>.effectiveSampleRate(span: DatadogSpan): Float? { | ||
| return when (this) { | ||
| is DeterministicTraceSampler -> getSampleRate(span) |
There was a problem hiding this comment.
I think this is not the best API design, as this layer becomes aware of the implementation of the Sampler interface.
Maybe we should add getSampleRate(T) method with the default implementation with fallback to getSampleRate()?
this will keep implementation isolated from the caller side + wont break backward compatibility. And in DeterministicTraceSampler you could implement independent behaviour. WDYT?
There was a problem hiding this comment.
This is a good point. I totally agree that this is far from an ideal solution.
I considered that option before implementation but adding getSampleRate(T) on the Sampler interface sounds like an unnecessary refactor considering that and some point we'll be able to get rid of that method. We only need getSampleRate(T) due to the legacy classes but when remove and make those private/internal we won't need that method anymore.
The extension function with the when (this is ...) check is admittedly not clean, but it keeps the coupling contained and co-located with the V4 TODO, rather than leaving a footprint on the interface.
|
|
||
| internal object RumContextKeys { | ||
| // Must match RumContext.SESSION_SAMPLE_RATE in dd-sdk-android-rum. | ||
| internal const val SESSION_SAMPLE_RATE = "session_sample_rate" |
There was a problem hiding this comment.
Please move it to the LogAttributes as they used for the RUM context propagation anyway. We should move them out of the LogAttributes for sure, but it should be made with the major version update. Before that moment, let's reduce files that holds constants for RUM context propagation
| * @param id a stable numerical identifier derived from the item being sampled. | ||
| * @return true if the item should be sampled, false otherwise. | ||
| */ | ||
| fun computeSamplingDecision(sampleRate: Float, id: ULong): Boolean { |
There was a problem hiding this comment.
why did you moved this code out of the DeterministicSampler ? It's seems like only descendants of it are using it...
There was a problem hiding this comment.
You can see the discussion: #3342 (comment)
…pdateReceiver for session sample rate caching
c8f50f9 to
d5d716f
Compare
hamorillo
left a comment
There was a problem hiding this comment.
Thanks for your review @satween. Really appreciate it.
But until then - you have to support TracingInterceptor/DatadogInterceptor as well.
This part should be already covered as they are using com.datadog.android.trace.DeterministicTraceSampler and we've implemented the getSampleRate(item: DatadogSpan): Float to cover that case where we need to extract the RUM session sample rate from the span.
While reviewing your comment is true that I found an issue where we weren't sending the right sample rate in _dd.agent_psr and _dd.rule_psr. Should be fixed here with a similar (not ideal) approach. For more context see: #3342 (comment)
Note that there is another effective sampling calculation logic that is applied to the telemetry event. Make sure that your changes properly works with that one
I would say, that part shouldn't be affected by the changes in this PR.
|
|
||
| internal fun Sampler<DatadogSpan>.effectiveSampleRate(span: DatadogSpan): Float? { | ||
| return when (this) { | ||
| is DeterministicTraceSampler -> getSampleRate(span) |
There was a problem hiding this comment.
This is a good point. I totally agree that this is far from an ideal solution.
I considered that option before implementation but adding getSampleRate(T) on the Sampler interface sounds like an unnecessary refactor considering that and some point we'll be able to get rid of that method. We only need getSampleRate(T) due to the legacy classes but when remove and make those private/internal we won't need that method anymore.
The extension function with the when (this is ...) check is admittedly not clean, but it keeps the coupling contained and co-located with the V4 TODO, rather than leaving a footprint on the interface.
| val sessionSampleRateProvider: () -> Float = { | ||
| resolveSessionSampleRate( | ||
| (sdkRef.get() as? InternalSdkCore) | ||
| ?.getFeatureContext(Feature.RUM_FEATURE_NAME, useContextThread = false) |
There was a problem hiding this comment.
Suggestion applied!
|
|
||
| internal object RumContextKeys { | ||
| // Must match RumContext.SESSION_SAMPLE_RATE in dd-sdk-android-rum. | ||
| internal const val SESSION_SAMPLE_RATE = "session_sample_rate" |
…in legacy interceptors Make DeterministicTraceSampler.getSampleRate(DatadogSpan) public so the legacy TracingInterceptor and DatadogInterceptor paths can call the span-aware overload at the PSR reporting sites. The type-check pattern mirrors the existing effectiveSampleRate approach in ApmNetworkInstrumentationExt. The session_sample_rate tag is already written onto the span by extractRumContext before the PSR metric is set, so the tag-based fallback fires correctly for samplers created via the legacy public constructors. getSampleRate(DatadogSpan) will be removed when the legacy interceptor path is retired in v4.
Replace the internal RumContextKeys object (single-constant file in dd-sdk-android-trace) with LogAttributes.RUM_SESSION_SAMPLE_RATE in dd-sdk-android-core, consistent with where the other RUM context propagation keys (RUM_SESSION_ID, RUM_VIEW_ID, etc.) are defined.
d5d716f to
8f392fd
Compare
What does this PR do?
DeterministicTraceSamplernow applies cross-product rebasing usingsessionSampleRate × traceSampleRate / 100whenever a valid session sample rate is available. This rebasing is provider-driven and no longer gated byrum.session.idon the span.The session sample rate is propagated from
RumSessionScope→RumContext→ SDK feature context.ApmNetworkInstrumentationConfigurationreads it at instrumentation setup time via asessionSampleRateProviderlambda and passes it toDeterministicTraceSampler. A compatibility fallback still readssession_sample_ratefrom span tags for integrations that bypassApmNetworkInstrumentationConfiguration._dd.agent_psrnow reports the same effective sample rate used by sampling decisions in instrumentation paths, including rebased rates when applicable.Motivation
Previously, rebasing behavior and effective-rate reporting were inconsistent across paths. This change makes rebasing policy explicit and consistent: when a session sample rate is available, trace sampling is rebased to preserve the intended cross-product relationship between session and trace sampling.
Additional Notes
Session rate stability:
RumSessionScopesnapshotssessionSampleRateatrenewSession()time rather than callinggetSampleRate()on everygetRumContext()invocation, so session-level metadata remains stable for the full session lifetime.RumContextKeys: A new internal object centralises thesession_sample_ratekey in the trace module.Follow-up
❓ 💡 We should consider, in the next major version, making
DeterministicTraceSamplernon-public so all instances are SDK-controlled. That would let us remove the span-tag compatibility fallback (session_sample_rate) and simplify sampling to a single internal effective-rate source.Review checklist (to be filled by reviewers)