-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Description
Description
When using Opik's ADK integration (OpikTracer + track_adk_agent_recursive) with Google ADK's ContextCacheConfig enabled, LLM spans are never finalized — they remain at _OPIK_SPAN_STATUS: started with no output or usage data.
The issue only occurs in SSE streaming mode (Cloud Run / production). Locally with InMemorySessionService and no streaming, everything works correctly.
Environment
opik==1.10.25google-adk==1.26.0- Python 3.12
- Cloud Run (SSE streaming mode)
- Vertex AI (
GOOGLE_GENAI_USE_VERTEXAI=True)
Reproduction
Setup
from google.adk.agents import Agent
from google.adk.agents.context_cache_config import ContextCacheConfig
from google.adk.apps import App
from opik.integrations.adk import OpikTracer, track_adk_agent_recursive
root_agent = Agent(
name="my_agent",
model=Gemini(model="gemini-3-flash-preview"),
# ... tools, callbacks, etc.
)
tracer = OpikTracer(name="my-agent", project_name="my-project")
track_adk_agent_recursive(root_agent, tracer)
app = App(
root_agent=root_agent,
name="app",
# This breaks Opik LLM spans:
context_cache_config=ContextCacheConfig(
min_tokens=2048,
ttl_seconds=1800,
),
)Steps
- Deploy to Cloud Run with SSE streaming enabled
- Send a message that triggers tool use (so there are 2 LLM calls)
- Check Opik traces — LLM spans show
_OPIK_SPAN_STATUS: started, no output, no usage
Expected
LLM spans should show _OPIK_SPAN_STATUS: ready_for_finalization with output and usage data.
Actual
All LLM spans stuck at started. Cloud Run logs show repeated:
OPIK: No current span found in context for model output update
This warning comes from opik_tracer.py:280-284 where context_storage.top_span_data() returns None.
Root Cause Analysis
Opik uses contextvars.ContextVar for span tracking (OpikContextStorage). The before_model_callback pushes span data via context_storage.add_span_data(), and after_model_callback retrieves it via context_storage.top_span_data().
When ContextCacheConfig is enabled, ADK's GeminiContextCacheManager creates OTel spans inside the LLM generation flow:
# google_llm.py:175
with tracer.start_as_current_span('handle_context_caching') as span:
cache_manager = GeminiContextCacheManager(self.api_client)
cache_metadata = await cache_manager.handle_context_caching(llm_request)And inside the cache manager:
# gemini_context_cache_manager.py:361
with tracer.start_as_current_span("create_cache") as span:
cached_content = await self.genai_client.aio.caches.create(...)These OTel start_as_current_span context managers, combined with SSE streaming (async generators + PROGRESSIVE_SSE_STREAMING), cause the ContextVar state set by before_model_callback to be invisible when after_model_callback runs inside the streaming generator's execution context.
Evidence
We tested systematically on staging:
| Deploy | Context Caching | LLM Span Status |
|---|---|---|
| Branch without context caching | OFF | ready_for_finalization ✅ |
| Main with context caching | ON | started ❌ |
| Branch with context caching removed | OFF | ready_for_finalization ✅ |
Same opik version (1.10.25) and ADK version (1.26.0) across all deploys.
Locally (no SSE streaming), context caching does NOT break Opik — both LLM spans finalize correctly.
Workaround
Disable context caching:
app = App(
root_agent=root_agent,
name="app",
# context_cache_config=ContextCacheConfig(...) # disabled
)Suggested Fix
The contextvars approach for span tracking is fragile with async generators and OTel span context managers. Possible fixes:
- Use span ID tracking instead of contextvars stack: Store spans in a dict keyed by a correlation ID that's passed through the callback arguments, rather than relying on
ContextVarstack ordering - Copy context explicitly: When creating the streaming generator, explicitly copy the current
contextvarscontext so span data is preserved - Fallback mechanism in
after_model_callback: Whentop_span_data()returnsNone, try to find the span by model name or other metadata before giving up
Related
- ADK integration: OpikADKOtelTracer kills all OpenTelemetry spans and re-patches on every request via __setstate__ #5374 (OpikADKOtelTracer replaces global TracerProvider)