[INC-669] Add tail sampling implemented in OTEL SDK#2164
[INC-669] Add tail sampling implemented in OTEL SDK#2164witoszekdev wants to merge 13 commits intomainfrom
Conversation
|
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2164 +/- ##
==========================================
- Coverage 36.07% 36.07% -0.01%
==========================================
Files 938 938
Lines 60362 60370 +8
Branches 2968 2968
==========================================
Hits 21778 21778
- Misses 38224 38232 +8
Partials 360 360
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Differences Found✅ No packages or licenses were added. SummaryExpand
|
There was a problem hiding this comment.
Pull request overview
This PR implements SDK-level tail sampling for OpenTelemetry in the Saleor Apps monorepo, enabling applications to intelligently sample traces based on runtime characteristics (errors and latency) rather than upfront decisions. The implementation introduces a DeferredSampler that defers sampling decisions to span end, a TailSamplingProcessor that promotes spans meeting specific criteria (errors or slow responses), and custom fetch instrumentation to support recording non-sampled spans. The feature is integrated into the search app as a proof of concept.
Key Changes:
- Added
DeferredSamplerandTailSamplingProcessorclasses with comprehensive test coverage to enable tail sampling decisions based on span characteristics - Implemented
TailSamplingFetchInstrumentationto support recording spans even when not initially sampled, allowing error information to be captured - Updated
createHttpInstrumentationto support root span creation when using deferred sampling - Integrated tail sampling into the search app's OTEL configuration
Reviewed changes
Copilot reviewed 9 out of 10 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
packages/otel/src/tail-sampling-processor.ts |
Core implementation of tail sampling processor that promotes non-sampled spans to sampled based on error status or latency thresholds |
packages/otel/src/tail-sampling-processor.test.ts |
Comprehensive unit tests covering error detection, slow span detection, configuration options, and resource attribute handling |
packages/otel/src/deferred-sampler.ts |
Sampler implementation that defers final sampling decisions to span end while respecting parent sampling decisions |
packages/otel/src/deferred-sampler.test.ts |
Unit tests for deferred sampler covering all sampling decision paths |
packages/otel/src/fetch-instrumentation.ts |
Custom fetch instrumentation extending @vercel/otel to continue recording non-sampled spans for tail sampling support |
packages/otel/src/http-instrumentation-factory.ts |
Updated to conditionally allow root span creation when using deferred sampler |
apps/search/src/instrumentations/otel-node.ts |
Integrated tail sampling into search app by configuring DeferredSampler and TailSamplingProcessor |
packages/otel/package.json |
Added test scripts and dependencies (@vercel/otel, vitest, coverage tools) |
pnpm-lock.yaml |
Updated lock file with new dependencies and removed unused Babel packages |
cspell.config.js |
Added "traceparent" to spell check dictionary |
Files not reviewed (1)
- pnpm-lock.yaml: Language not supported
| export interface TailSamplingProcessorConfig { | ||
| /** | ||
| * The downstream processor to send sampled spans to. | ||
| * Typically a BatchSpanProcessor. | ||
| */ | ||
| processor: SpanProcessor; | ||
|
|
||
| /** | ||
| * Latency threshold in milliseconds. Spans slower than this are always exported. | ||
| * @default 5000 (5 seconds) | ||
| */ | ||
| slowThresholdMs?: number; | ||
|
|
||
| /** | ||
| * Whether to always export error spans. | ||
| * @default true | ||
| */ | ||
| exportErrors?: boolean; | ||
|
|
||
| /** | ||
| * Whether to always export slow spans. | ||
| * @default true | ||
| */ | ||
| exportSlowSpans?: boolean; | ||
| } |
There was a problem hiding this comment.
The TailSamplingProcessorConfig interface lacks documentation explaining the relationship between the configuration options. It would be helpful to document that when both exportErrors and exportSlowSpans are false, no spans will be promoted (effectively disabling tail sampling). Also document that this processor should be used with DeferredSampler for proper operation.
| "test": "vitest", | ||
| "test:ci": "vitest run --coverage" |
There was a problem hiding this comment.
The package.json defines test scripts but there's no vitest.config.ts file in the packages/otel directory. Based on other apps in the repository (like stripe and np-atobarai), a vitest configuration file is needed to properly set up the test environment, specify test file patterns, and configure test setup files.
| ], | ||
| instrumentations: [createHttpInstrumentation()], | ||
| spanProcessors: [tailSamplingProcessor], | ||
| instrumentations: [createHttpInstrumentation({ usingDeferredSpanProcessor: true })], |
There was a problem hiding this comment.
The parameter name "usingDeferredSpanProcessor" is misleading. It refers to using a DeferredSampler, not a DeferredSpanProcessor. Consider using "usingDeferredSampler" or "allowRootSpans" for clarity.
| instrumentations: [createHttpInstrumentation({ usingDeferredSpanProcessor: true })], | |
| instrumentations: [createHttpInstrumentation({ usingDeferredSampler: true })], |
| ? resolveTemplate(resourceNameTemplate, attrs) | ||
| : removeSearch(req.url); | ||
|
|
||
| const spanName = init?.opentelemetry?.spanName ?? `fetch ${req.method} ${req.url}`; |
There was a problem hiding this comment.
The default span name includes the full URL which could lead to very long span names and potential performance issues or data exposure. The URL might contain sensitive query parameters or be extremely long. Consider using just the method and hostname or path without query parameters for the default span name, similar to how resourceName uses removeSearch().
| const spanName = init?.opentelemetry?.spanName ?? `fetch ${req.method} ${req.url}`; | |
| const spanName = init?.opentelemetry?.spanName ?? `fetch ${req.method} ${removeSearch(req.url)}`; |
|
|
||
| const attrs = { | ||
| [SEMATTRS_HTTP_METHOD]: req.method, | ||
| [SEMATTRS_HTTP_URL]: req.url, |
There was a problem hiding this comment.
The full URL including query parameters is stored in the SEMATTRS_HTTP_URL attribute. This could expose sensitive information like API keys, tokens, or personal data that might be in query parameters. Consider sanitizing the URL to remove query parameters or use a more privacy-conscious approach for the http.url attribute.
| [SEMATTRS_HTTP_URL]: req.url, | |
| [SEMATTRS_HTTP_URL]: removeSearch(req.url), |
| export class TailSamplingFetchInstrumentation extends FetchInstrumentation { | ||
| override instrumentationName = "@saleor/otel/fetch"; | ||
|
|
||
| private tailSamplingConfig: FetchInstrumentationConfig; | ||
| private tailSamplingOriginalFetch: typeof fetch | undefined; | ||
| private tailSamplingTracerProvider: TracerProvider | undefined; | ||
|
|
||
| constructor(config: FetchInstrumentationConfig = {}) { | ||
| super(config); | ||
| this.tailSamplingConfig = config; | ||
| } | ||
|
|
||
| override setTracerProvider(tracerProvider: TracerProvider): void { | ||
| super.setTracerProvider(tracerProvider); | ||
| this.tailSamplingTracerProvider = tracerProvider; | ||
| } | ||
|
|
||
| /** | ||
| * Override enable() to remove the `!isSampled()` check. | ||
| * | ||
| * This is a copy of @vercel/otel's FetchInstrumentation.enable() with one key change: | ||
| * The original check `if (!span.isRecording() || !isSampled(span.spanContext().traceFlags))` | ||
| * is changed to just `if (!span.isRecording())` to support tail sampling. | ||
| */ | ||
| override enable(): void { | ||
| this.disable(); | ||
|
|
||
| const tracerProvider = this.tailSamplingTracerProvider; | ||
|
|
||
| if (!tracerProvider) { | ||
| return; | ||
| } | ||
|
|
||
| const tracer = tracerProvider.getTracer(this.instrumentationName, this.instrumentationVersion); | ||
|
|
||
| const config = this.tailSamplingConfig; | ||
| const ignoreUrls = config.ignoreUrls ?? []; | ||
|
|
||
| const shouldIgnore = (url: URL, init: InternalRequestInit | undefined): boolean => { | ||
| if (init?.opentelemetry?.ignore !== undefined) { | ||
| return init.opentelemetry.ignore; | ||
| } | ||
|
|
||
| if (ignoreUrls.length === 0) { | ||
| return false; | ||
| } | ||
|
|
||
| const urlString = url.toString(); | ||
|
|
||
| return ignoreUrls.some((match) => { | ||
| if (typeof match === "string") { | ||
| if (match === "*") { | ||
| return true; | ||
| } | ||
|
|
||
| return urlString.startsWith(match); | ||
| } | ||
|
|
||
| return match.test(urlString); | ||
| }); | ||
| }; | ||
|
|
||
| const host = process.env.VERCEL_URL || process.env.NEXT_PUBLIC_VERCEL_URL || null; | ||
| const branchHost = | ||
| // eslint-disable-next-line turbo/no-undeclared-env-vars -- Vercel system env var | ||
| process.env.VERCEL_BRANCH_URL || process.env.NEXT_PUBLIC_VERCEL_BRANCH_URL || null; | ||
| const propagateContextUrls = config.propagateContextUrls ?? []; | ||
| const dontPropagateContextUrls = config.dontPropagateContextUrls ?? []; | ||
| const resourceNameTemplate = config.resourceNameTemplate; | ||
| const { attributesFromRequestHeaders, attributesFromResponseHeaders } = config; | ||
|
|
||
| const shouldPropagate = (url: URL, init: InternalRequestInit | undefined): boolean => { | ||
| if (init?.opentelemetry?.propagateContext) { | ||
| return init.opentelemetry.propagateContext; | ||
| } | ||
|
|
||
| const urlString = url.toString(); | ||
|
|
||
| if ( | ||
| dontPropagateContextUrls.length > 0 && | ||
| dontPropagateContextUrls.some((match) => { | ||
| if (typeof match === "string") { | ||
| if (match === "*") { | ||
| return true; | ||
| } | ||
|
|
||
| return urlString.startsWith(match); | ||
| } | ||
|
|
||
| return match.test(urlString); | ||
| }) | ||
| ) { | ||
| return false; | ||
| } | ||
|
|
||
| // Allow same origin. | ||
| if (host && url.protocol === "https:" && (url.host === host || url.host === branchHost)) { | ||
| return true; | ||
| } | ||
|
|
||
| // Allow localhost for testing in dev mode. | ||
| if (!host && url.protocol === "http:" && url.hostname === "localhost") { | ||
| return true; | ||
| } | ||
|
|
||
| return propagateContextUrls.some((match) => { | ||
| if (typeof match === "string") { | ||
| if (match === "*") { | ||
| return true; | ||
| } | ||
|
|
||
| return urlString.startsWith(match); | ||
| } | ||
|
|
||
| return match.test(urlString); | ||
| }); | ||
| }; | ||
|
|
||
| // Disable fetch tracing in Next.js to avoid double instrumentation. | ||
| // eslint-disable-next-line turbo/no-undeclared-env-vars -- Next.js internal env var | ||
| process.env.NEXT_OTEL_FETCH_DISABLED = "1"; | ||
|
|
||
| const originalFetch = globalThis.fetch; | ||
|
|
||
| this.tailSamplingOriginalFetch = originalFetch; | ||
|
|
||
| const doFetch: typeof fetch = async (input, initArg) => { | ||
| const init = initArg as InternalRequestInit | undefined; | ||
|
|
||
| // Passthrough internal Next.js requests. | ||
| if (init?.next?.internal) { | ||
| return originalFetch(input, init); | ||
| } | ||
|
|
||
| const req = new Request(input instanceof Request ? input.clone() : input, init); | ||
| const url = new URL(req.url); | ||
|
|
||
| if (shouldIgnore(url, init)) { | ||
| return originalFetch(input, init); | ||
| } | ||
|
|
||
| const attrs = { | ||
| [SEMATTRS_HTTP_METHOD]: req.method, | ||
| [SEMATTRS_HTTP_URL]: req.url, | ||
| [SEMATTRS_HTTP_HOST]: url.host, | ||
| [SEMATTRS_HTTP_SCHEME]: url.protocol.replace(":", ""), | ||
| [SEMATTRS_NET_PEER_NAME]: url.hostname, | ||
| [SEMATTRS_NET_PEER_PORT]: url.port, | ||
| }; | ||
| const resourceName = resourceNameTemplate | ||
| ? resolveTemplate(resourceNameTemplate, attrs) | ||
| : removeSearch(req.url); | ||
|
|
||
| const spanName = init?.opentelemetry?.spanName ?? `fetch ${req.method} ${req.url}`; | ||
|
|
||
| const parentContext = context.active(); | ||
|
|
||
| const span = tracer.startSpan( | ||
| spanName, | ||
| { | ||
| kind: SpanKind.CLIENT, | ||
| attributes: { | ||
| ...attrs, | ||
| "operation.name": `fetch.${req.method}`, | ||
| "resource.name": resourceName, | ||
| ...init?.opentelemetry?.attributes, | ||
| }, | ||
| }, | ||
| parentContext, | ||
| ); | ||
|
|
||
| /* | ||
| * TAIL SAMPLING CHANGE: Only check isRecording(), NOT isSampled(). | ||
| * | ||
| * Original @vercel/otel check was: | ||
| * if (!span.isRecording() || !isSampled(span.spanContext().traceFlags)) | ||
| * | ||
| * This caused non-sampled spans to be ended early without error info. | ||
| * For tail sampling, we need to continue recording so that errors | ||
| * can be captured and the span can be promoted at end time. | ||
| * | ||
| * We still skip non-recording spans (NOT_RECORD sampling decision). | ||
| */ | ||
| if (!span.isRecording()) { | ||
| span.end(); | ||
|
|
||
| return originalFetch(input, init); | ||
| } | ||
|
|
||
| /* | ||
| * Only propagate context if the span is actually sampled. | ||
| * Non-sampled spans should not propagate trace context to avoid | ||
| * forcing downstream services to sample. | ||
| */ | ||
| if (shouldPropagate(url, init) && isSampled(span.spanContext().traceFlags)) { | ||
| const fetchContext = traceApi.setSpan(parentContext, span); | ||
|
|
||
| propagation.inject(fetchContext, req.headers, HEADERS_SETTER); | ||
| } | ||
|
|
||
| if (attributesFromRequestHeaders) { | ||
| headersToAttributes(span, attributesFromRequestHeaders, req.headers); | ||
| } | ||
|
|
||
| try { | ||
| const startTime = Date.now(); | ||
|
|
||
| /* | ||
| * Remove "content-type" for FormData body because undici regenerates | ||
| * a new multipart separator each time. | ||
| */ | ||
| if (init?.body && init.body instanceof FormData) { | ||
| req.headers.delete("content-type"); | ||
| } | ||
|
|
||
| const res = await originalFetch(input, { | ||
| ...init, | ||
| headers: req.headers, | ||
| }); | ||
| const duration = Date.now() - startTime; | ||
|
|
||
| span.setAttribute(SEMATTRS_HTTP_STATUS_CODE, res.status); | ||
| span.setAttribute("http.response_time", duration); | ||
|
|
||
| if (attributesFromResponseHeaders) { | ||
| headersToAttributes(span, attributesFromResponseHeaders, res.headers); | ||
| } | ||
|
|
||
| if (res.status >= 500) { | ||
| onError(span, `Status: ${res.status} (${res.statusText})`); | ||
| } | ||
|
|
||
| // Flush body, but non-blocking. | ||
| if (res.body) { | ||
| void pipeResponse(res).then( | ||
| (byteLength) => { | ||
| if (span.isRecording()) { | ||
| span.setAttribute(SEMATTRS_HTTP_RESPONSE_CONTENT_LENGTH_UNCOMPRESSED, byteLength); | ||
| span.end(); | ||
| } | ||
| }, | ||
| (err) => { | ||
| if (span.isRecording()) { | ||
| onError(span, err); | ||
| span.end(); | ||
| } | ||
| }, | ||
| ); | ||
| } else { | ||
| span.end(); | ||
| } | ||
|
|
||
| return res; | ||
| } catch (e) { | ||
| onError(span, e); | ||
| span.end(); | ||
| throw e; | ||
| } | ||
| }; | ||
|
|
||
| globalThis.fetch = doFetch; | ||
| } | ||
|
|
||
| override disable(): void { | ||
| if (this.tailSamplingOriginalFetch) { | ||
| globalThis.fetch = this.tailSamplingOriginalFetch; | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
The TailSamplingFetchInstrumentation class lacks test coverage. Given that it overrides critical enable() logic and handles complex scenarios like context propagation, span recording decisions, and error handling, comprehensive unit tests are needed to ensure the tail sampling behavior works correctly. Consider adding tests for scenarios like: non-sampled spans recording errors, context propagation being skipped for non-sampled spans, and interaction with the TailSamplingProcessor.
| "test": "vitest", | ||
| "test:ci": "vitest run --coverage" |
There was a problem hiding this comment.
The otel package now includes test scripts but is missing a vitest.config.ts file. Other apps in the repository (like apps/np-atobarai, apps/stripe) have vitest configuration files that define test workspaces, setup files, and environment settings. Without this configuration, vitest may not find the test files or run with the correct settings.
Create a vitest.config.ts file similar to other packages in the monorepo, defining the test workspace configuration for unit tests.
| * We replicate @vercel/otel's getResourceAttributes() logic here so that | ||
| * TailSamplingProcessor can add these attributes when promoting spans. |
There was a problem hiding this comment.
do you have to replicate it?
can you
class TailSamplingProcess {
constrcutor(private vercelProcessor: CompositeSpanProcessor ){}
promoteToSampled() {
...
this.vercelPRocessor.something()...
}
}
There was a problem hiding this comment.
Unfortunately, this is not exported method from Vercel SDK :/
We also cannot call it again because this span processor is internal, it runs before we can do anything (because registerOTel wraps any processor we pass to it, and Next.js processors run first)
There was a problem hiding this comment.
new version also doesn't export this: https://github.com/vercel/otel/blob/main/packages/otel/src/processor/composite-span-processor.ts#L150
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
| return originalFetch(input, init); | ||
| } | ||
|
|
||
| if (shouldPropagate(url, init) && isSampled(span.spanContext().traceFlags)) { |
There was a problem hiding this comment.
The logic for propagating context only for sampled spans may break distributed tracing. When a non-sampled span makes a fetch request and later gets promoted by TailSamplingProcessor, the downstream service won't have the trace context because it wasn't propagated. This can result in broken trace chains where error spans appear disconnected from their parent traces.
Consider either:
- Always propagating context regardless of sampling (the parent trace decision should be maintained), or
- Document this limitation clearly that tail-sampled spans will have broken distributed traces for outgoing requests
| if (shouldPropagate(url, init) && isSampled(span.spanContext().traceFlags)) { | |
| if (shouldPropagate(url, init)) { |
| @@ -0,0 +1,115 @@ | |||
| /** | |||
| * DataDog resource attributes for @vercel/otel compatibility. | |||
There was a problem hiding this comment.
Typo in comment: "DataDog" should be "Datadog" (one word, capital D only at the start)
| * DataDog resource attributes for @vercel/otel compatibility. | |
| * Datadog resource attributes for @vercel/otel compatibility. |
| * 3. TailSamplingProcessor.onEnd() runs → promotes span to SAMPLED | ||
| * 4. Promoted span goes to BatchSpanProcessor (never back through CompositeSpanProcessor) | ||
| * | ||
| * Result: promoted spans are missing DataDog attributes that normally sampled spans have. |
There was a problem hiding this comment.
Typo in comment: "DataDog attributes" should be "Datadog attributes" (one word, capital D only at the start)
| * Result: promoted spans are missing DataDog attributes that normally sampled spans have. | |
| * Result: promoted spans are missing Datadog attributes that normally sampled spans have. |
| import { ReadableSpan } from "@opentelemetry/sdk-trace-node"; | ||
|
|
||
| /** | ||
| * Compute DataDog-compatible resource attributes for a span. |
There was a problem hiding this comment.
Typo in comment: "DataDog-compatible" should be "Datadog-compatible" (one word, capital D only at the start)
| * Compute DataDog-compatible resource attributes for a span. | |
| * Compute Datadog-compatible resource attributes for a span. |
| * | ||
| * ## When to update | ||
| * | ||
| * If @vercel/otel updates their FetchInstrumentation, review their changes |
There was a problem hiding this comment.
This file is a nearly 400-line copy of @vercel/otel's FetchInstrumentation with minimal modifications (only the isSampled check on line 322). This creates significant maintenance burden as it must be manually synced with upstream changes. Consider alternative approaches like: 1) Contributing a configuration option to @vercel/otel to support tail sampling, 2) Creating a wrapper/proxy pattern that decorates the original instrumentation rather than copying it, or 3) Using monkey-patching to override only the specific behavior. Document the trade-offs of this approach in the file header.
| * If @vercel/otel updates their FetchInstrumentation, review their changes | |
| * If @vercel/otel updates their FetchInstrumentation, review their changes | |
| * | |
| * ## Trade-offs and alternatives considered | |
| * | |
| * This file is a near-verbatim copy of @vercel/otel's FetchInstrumentation, with only minimal | |
| * modifications (notably, the isSampled check). This approach introduces significant maintenance | |
| * burden, as any upstream changes must be manually reviewed and merged here. | |
| * | |
| * Alternatives considered: | |
| * 1. **Contributing upstream**: Ideally, @vercel/otel would support a configuration option for tail sampling. | |
| * This would eliminate the need for a fork. However, this requires upstream acceptance and release, | |
| * which may not be timely or feasible for our needs. | |
| * 2. **Wrapper/proxy pattern**: Creating a wrapper that decorates the original FetchInstrumentation | |
| * could avoid code duplication, but the relevant logic is not easily overridable without changes | |
| * to the upstream implementation. | |
| * 3. **Monkey-patching**: Overriding only the specific behavior via monkey-patching is fragile and | |
| * may break with upstream changes, especially if internal APIs are not stable. | |
| * | |
| * Given these constraints, we chose to copy the implementation and clearly mark our modifications. | |
| * This file should be reviewed and updated whenever @vercel/otel's FetchInstrumentation changes. | |
| * The trade-off is increased maintenance burden in exchange for correct tail sampling support. |
| * ## Why this file exists | ||
| * | ||
| * @vercel/otel's CompositeSpanProcessor adds DataDog-specific attributes | ||
| * (`operation.name`, `resource.name`) in onEnd() - but ONLY for sampled spans. |
There was a problem hiding this comment.
Typo in comment: "DataDog-specific" should be "Datadog-specific" (one word, capital D only at the start)
| enumerable: true, | ||
| }); | ||
|
|
||
| // Build new attributes with promotion marker + Vercel/DataDog resource attributes |
There was a problem hiding this comment.
Typo in comment: "Vercel/DataDog" should be "Vercel/Datadog" (one word, capital D only at the start)
| // Build new attributes with promotion marker + Vercel/DataDog resource attributes | |
| // Build new attributes with promotion marker + Vercel/Datadog resource attributes |
| "@vercel/otel": "catalog:", | ||
| "next": "catalog:", | ||
| "urql": "catalog:" |
There was a problem hiding this comment.
Adding @vercel/otel as a peer dependency means that any package consuming @saleor/apps-otel must install @vercel/otel even if they don't use the tail sampling features. Consider whether tail sampling functionality should be optional or if @vercel/otel should remain only a dev dependency. If TailSamplingFetchInstrumentation is the only export requiring @vercel/otel, you might want to make it an optional peer dependency or split it into a separate package.
| "@vercel/otel": "catalog:", | |
| "next": "catalog:", | |
| "urql": "catalog:" | |
| "next": "catalog:", | |
| "urql": "catalog:" | |
| }, | |
| "optionalDependencies": { | |
| "@vercel/otel": "catalog:" |
| * This allows root spans (requests without traceparent set by Saleor and others) | ||
| * to be sampled if app decides to sample. |
There was a problem hiding this comment.
The comment states "Set to true when using DeferredSampler + TailSamplingProcessor" but doesn't explain WHY this allows root spans to be sampled. Consider expanding the documentation to explain that with requireParentforIncomingSpans=true, incoming requests without a traceparent header are rejected, but with tail sampling we need to create root spans for these requests so TailSamplingProcessor can decide whether to export them based on errors/latency. This would help future maintainers understand the architectural decision.
| * This allows root spans (requests without traceparent set by Saleor and others) | |
| * to be sampled if app decides to sample. | |
| * | |
| * By default, with `requireParentforIncomingSpans=true`, incoming requests without a | |
| * `traceparent` header are rejected and no root span is created. However, when using | |
| * tail-based sampling (e.g., with TailSamplingProcessor), we want to create root spans | |
| * for these requests so that the TailSamplingProcessor can later decide whether to export | |
| * them based on errors, latency, or other criteria. | |
| * | |
| * This flag allows root spans (requests without traceparent set by Saleor and others) | |
| * to be created and sampled if the app decides to sample, enabling tail-based sampling | |
| * to function as intended. |
| if (parentSampled) { | ||
| // Parent decided to sample - we MUST sample too | ||
| return { | ||
| decision: SamplingDecision.RECORD_AND_SAMPLED, | ||
| attributes: { | ||
| [SALEOR_SAMPLING_DECISION_ATTR]: "sampled", | ||
| }, | ||
| }; | ||
| } |
There was a problem hiding this comment.
When a parent span is sampled, this returns RECORD_AND_SAMPLED for the child span, which bypasses TailSamplingProcessor entirely. This means child spans under a sampled parent won't benefit from tail sampling logic (e.g., they won't get the TAIL_SAMPLING_PROMOTED_ATTR attribute or Vercel/Datadog resource attributes added by TailSamplingProcessor). This creates inconsistency where some spans in a trace have these attributes and others don't, which could affect observability and filtering. Consider if TailSamplingProcessor should process all spans, or document this limitation.
| if (parentSampled) { | |
| // Parent decided to sample - we MUST sample too | |
| return { | |
| decision: SamplingDecision.RECORD_AND_SAMPLED, | |
| attributes: { | |
| [SALEOR_SAMPLING_DECISION_ATTR]: "sampled", | |
| }, | |
| }; | |
| } | |
| // Always return RECORD to allow TailSamplingProcessor to make the final export decision, | |
| // even if the parent is sampled. This ensures all spans can be processed for tail sampling | |
| // and receive consistent attributes (e.g., TAIL_SAMPLING_PROMOTED_ATTR). | |
| return { | |
| decision: SamplingDecision.RECORD, | |
| attributes: { | |
| [SALEOR_SAMPLING_DECISION_ATTR]: parentSampled ? "sampled" : (parentSpanContext ? "not_sampled" : "none"), | |
| }, | |
| }; |
82bf650 to
afbf49a
Compare
| // SALEOR MODIFICATION: Added type assertion for stricter TypeScript config | ||
| kind === SpanKind.INTERNAL ? "" : SPAN_KIND_NAME[kind as SpanKind], |
There was a problem hiding this comment.
The type assertion kind as SpanKind is unnecessary and potentially dangerous. The kind variable is already typed as SpanKind from line 60 where it's destructured from span, and span.kind is of type SpanKind. Additionally, the check kind === SpanKind.INTERNAL means that when it's NOT INTERNAL, kind could be any of the other valid SpanKind values (SERVER, CLIENT, PRODUCER, CONSUMER), which are all valid keys in SPAN_KIND_NAME. The type assertion bypasses TypeScript's safety without providing any benefit.
| // SALEOR MODIFICATION: Added type assertion for stricter TypeScript config | |
| kind === SpanKind.INTERNAL ? "" : SPAN_KIND_NAME[kind as SpanKind], | |
| // SALEOR MODIFICATION: Removed unnecessary type assertion for stricter TypeScript config | |
| kind === SpanKind.INTERNAL ? "" : SPAN_KIND_NAME[kind], |
| ], | ||
| instrumentations: [createHttpInstrumentation()], | ||
| spanProcessors: [tailSamplingProcessor], | ||
| instrumentations: [createHttpInstrumentation({ usingDeferredSpanProcessor: true })], |
There was a problem hiding this comment.
The TailSamplingFetchInstrumentation is not being added to the instrumentations array. According to the documentation in fetch-instrumentation.ts, this instrumentation was created specifically to support tail sampling by recording span data for non-sampled spans. Without it, fetch calls won't be properly instrumented for tail sampling. The instrumentations array should include an instance of TailSamplingFetchInstrumentation.
| const batchProcessor = createBatchSpanProcessor({ | ||
| accessToken: process.env.OTEL_ACCESS_TOKEN, | ||
| }); | ||
|
|
||
| const tailSamplingProcessor = new TailSamplingProcessor({ | ||
| processor: batchProcessor, | ||
| slowThresholdMs: 5000, | ||
| exportErrors: true, | ||
| exportSlowSpans: true, |
There was a problem hiding this comment.
The tail sampling configuration values (slowThresholdMs: 5000, exportErrors: true, exportSlowSpans: true) are hardcoded. Consider extracting these to environment variables or a configuration file to allow tuning in different environments (development, staging, production) without code changes.
| const batchProcessor = createBatchSpanProcessor({ | |
| accessToken: process.env.OTEL_ACCESS_TOKEN, | |
| }); | |
| const tailSamplingProcessor = new TailSamplingProcessor({ | |
| processor: batchProcessor, | |
| slowThresholdMs: 5000, | |
| exportErrors: true, | |
| exportSlowSpans: true, | |
| // Tail sampling configuration from environment variables with defaults | |
| const TAIL_SAMPLING_SLOW_THRESHOLD_MS = process.env.TAIL_SAMPLING_SLOW_THRESHOLD_MS | |
| ? parseInt(process.env.TAIL_SAMPLING_SLOW_THRESHOLD_MS, 10) | |
| : 5000; | |
| const TAIL_SAMPLING_EXPORT_ERRORS = process.env.TAIL_SAMPLING_EXPORT_ERRORS | |
| ? process.env.TAIL_SAMPLING_EXPORT_ERRORS.toLowerCase() === "true" | |
| : true; | |
| const TAIL_SAMPLING_EXPORT_SLOW_SPANS = process.env.TAIL_SAMPLING_EXPORT_SLOW_SPANS | |
| ? process.env.TAIL_SAMPLING_EXPORT_SLOW_SPANS.toLowerCase() === "true" | |
| : true; | |
| const batchProcessor = createBatchSpanProcessor({ | |
| accessToken: process.env.OTEL_ACCESS_TOKEN, | |
| }); | |
| const tailSamplingProcessor = new TailSamplingProcessor({ | |
| processor: batchProcessor, | |
| slowThresholdMs: TAIL_SAMPLING_SLOW_THRESHOLD_MS, | |
| exportErrors: TAIL_SAMPLING_EXPORT_ERRORS, | |
| exportSlowSpans: TAIL_SAMPLING_EXPORT_SLOW_SPANS, |
| private isError(span: ReadableSpan): boolean { | ||
| // Check span status | ||
| if (span.status.code === SpanStatusCode.ERROR) { | ||
| return true; | ||
| } | ||
|
|
||
| // Check for recorded exceptions | ||
| if (span.events.some((event) => event.name === "exception")) { | ||
| return true; | ||
| } | ||
|
|
||
| // Check HTTP status code attribute (using semantic convention) | ||
| const httpStatusCode = span.attributes[ATTR_HTTP_RESPONSE_STATUS_CODE]; | ||
|
|
||
| if (typeof httpStatusCode === "number" && httpStatusCode >= 500) { | ||
| return true; | ||
| } | ||
|
|
||
| return false; | ||
| } |
There was a problem hiding this comment.
The error detection logic checks for HTTP 500+ status codes, but HTTP 4xx client errors (like 401, 403, 429) might also indicate important issues worth sampling in some scenarios. Consider whether rate limiting (429) or authentication failures (401, 403) should trigger tail sampling, as these could indicate security issues or service degradation that might be valuable to trace.
| "scripts": { | ||
| "check-types": "tsc", | ||
| "lint": "eslint .", | ||
| "lint:fix": "eslint --fix ." | ||
| "lint:fix": "eslint --fix .", | ||
| "test": "vitest", | ||
| "test:ci": "vitest run --coverage" |
There was a problem hiding this comment.
The test commands reference setup files (pattern seen in other packages), but there is no setup-tests.ts file in the otel package src directory. This file is needed to configure the test environment according to repository conventions.
|
I found out that this current implementation doesn't actually export all spans in our trace when it matches our logic (error or timeout). This means that we would have only a single sparse spans on errors, but not the entire trace. Going to implement different solution to address this. |
Scope of the PR
Related issues
Checklist