[Proposal] Streaming Policy Architecture for Policy Engine #706
Replies: 3 comments 2 replies
-
|
Hi Renuka, Overall this looks good. +1 for option 2. My only concern is the user experience. As a user, I feel like we should show relevant policies from the list based on the initial API creation. An API can be REST or a streaming API. Based on that we should show relevant policies rather than waiting for the deployment time. Thanks! |
Beta Was this translation helpful? Give feedback.
-
1. Problems with existing Policy Implementation1.1 — `Policy InterfaceEvery current policy exposed a // v1
type Policy interface {
Mode() ProcessingMode // called once at startup
OnRequest(ctx, params) RequestAction
OnResponse(ctx, params) ResponseAction
}The HTTP request-response lifecycle has distinct phases, and what can physically be mutated at each phase is different:
The v1 interface returns the same broad 1.2 — Response
|
| Chain composition | Envoy response body mode |
|---|---|
ALL response-body policies implement StreamingResponseBodyPolicy |
FULL_DUPLEX_STREAMED |
ANY response-body policy implements only ResponseBodyPolicy |
BUFFERED (forced) |
** Since we enforce onRequestBody and onResponseBody method to be implemented for streaming policies as well all the policies will be compatible with each other and methods can be executed without any issue.
2.2 — Return Action types mirror constraints
Each phase returns a distinct action type. The type system encodes what action is possible at that phase — policy authors cannot attempt mutations Envoy cannot honour.
Phase Action type Capabilities
─────────────────────────── ────────────────────── ──────────────────────────────────────────
Request headers HeaderAction Header mutations, ImmediateResponse
Request body (buffered) RequestBodyAction Body + header + routing mutations, ImmediateResponse
Response headers HeaderAction Header mutations, ImmediateResponse
Response body (buffered) ResponseBodyAction Body + status mutations, ImmediateResponse
Request chunk (streaming) RequestChunkAction Chunk mutation only
Response chunk (streaming) ResponseChunkAction Chunk mutation only
Key constraint:
ImmediateResponseis absent from both streaming chunk action types. Once response headers are committed to the downstream client, injecting a new response mid-stream is physically impossible. Encoding this in the type system prevents an entire class of incorrect policy implementations.
3. Streaming Detection & Mode Override Strategy
This is a operationally critical part of the Policy Engine. The wrong decision at the wrong time causes either unnecessary latency (buffering a streaming response) or broken Content-Length headers (streaming a buffered response).
3.1 — The point of no return
When Envoy processes response headers, it decides whether to send Content-Length or Transfer-Encoding: chunked to the downstream client before the ext_proc response for ResponseHeaders is processed. This means:
BUFFERED → FULL_DUPLEX_STREAMEDupgrade atResponseHeaders_processing: ✅ worksFULL_DUPLEX_STREAMED → BUFFEREDdowngrade atResponseHeaders_processing: ❌ broken — Envoy has already decided to use chunked encoding for the client, stripping the content length header. This will cause issues for clients expecting a full json response with content length header.
3.2 — The strategy: default BUFFERED, upgrade only when streaming is confirmed
Processing mode can be set at Envou listener level and per route. We will have the default mode set for request_body_mode and response_body_mode: BUFFERED
typed_per_filter_config:
envoy.filters.http.ext_proc:
"@type": type.googleapis.com/envoy.extensions.filters.http.ext_proc.v3.ExtProcPerRoute
overrides:
processing_mode:
request_header_mode: SEND
response_header_mode: SEND
request_body_mode: BUFFERED
response_body_mode: BUFFERED
request_trailer_mode: SKIP
response_trailer_mode: SKIP
sequenceDiagram
participant E as Envoy
participant K as Kernel
participant P as Policy Chain
E->>K: RequestHeaders
K->>P: OnRequestHeaders()
K-->>E: ProcessingResponse\n[ResponseBodyMode = BUFFERED] ← safe default
E->>K: RequestBody
K->>P: OnRequestBody()
K-->>E: ProcessingResponse
Note over E,K: Upstream response arrives
E->>K: ResponseHeaders
K->>K: inspect Transfer-Encoding, Content-Type
alt isChunked OR isSSE
K->>K: isStreaming = true
alt chain.StreamResponseBody = true
K-->>E: ModeOverride = FULL_DUPLEX_STREAMED ✅
else chain.StreamResponseBody = false
K-->>E: ModeOverride = BUFFERED\n(chain needs full body)
end
else non-streaming
K-->>E: ModeOverride = BUFFERED\n(preserve Content-Length)
end
E->>K: ResponseBody chunk(s)
K->>P: OnResponseBodyChunk() or OnResponseBody()
K-->>E: ProcessingResponse
3.3 — Response body routing matrix
flowchart TD
A[ResponseHeaders received] --> B{isStreaming?}
B -->|No| C[BUFFERED mode
Content-Length preserved]
B -->|Yes| D{chain.StreamResponseBody?}
D -->|No — any policy buffered-only| E[BUFFERED mode
Envoy assembles all
SSE chunks]
D -->|Yes — all policies streaming| F[FULL_DUPLEX_STREAMED
mode
chunks arrive per event]
C --> G[OnResponseBody
Content: complete plain JSON
ImmediateResponse available ✅]
E --> H[OnResponseBody
Content: aggregated SSE string
ImmediateResponse available ✅]
F --> I[OnResponseBodyChunk per flush
Content: SSE event or JSON chunk
No ImmediateResponse ❌]
4. ChunkBuffering — Policies Control Their Own Flush Boundary
4.1 — The interface
type ChunkBuffering interface {
NeedsMoreData(accumulated []byte) bool
}EOS is the policy engine's responsibility. When the stream ends, the kernel flushes unconditionally and never calls
NeedsMoreData.
4.2 — Accumulation flow
flowchart TD
A[Envoy delivers chunk] --> B[accumBuf += chunk]
B --> C{eos?}
C -->|Yes| G
C -->|No| D{any policy
NeedsMoreData
accumBuf ?}
D -->|Yes| E[HOLD
Send empty ack to Envoy
Client receives nothing yet]
E --> A
D -->|No| G[FLUSH
Run chain on accumBuf
Send mutated result downstream]
G --> H[Reset accumBuf]
Hold = suppress, not echo. During the hold phase, chunks are hold — not forwarded downstream. For mutating policies like PII masking this is essential: you never want partially-unmasked content reaching the client.
4.3 — Built-in strategies (no custom logic needed)
We can build utility functions for common use cases like waiting for a minimum context window/ waiting for a specific delimiter. Policy authors can use these utility functions in their needsMoreData function to ease implementation.
5. The Response JSONPath Problem — Detailed Breakdown
5.1 — Why one path cannot work for both response shapes
sequenceDiagram
participant C as Client
participant G as Gateway (ext_proc)
participant L as LLM (OpenAI)
rect rgb(220, 240, 255)
Note over C,L: Non-streaming (stream: false)
C->>G: POST /chat/completions\n{"stream": false}
G->>L: forwarded request
L-->>G: HTTP 200 Content-Length: 842\n{"choices":[{"message":{"content":"..."}}]}
Note over G: Plain JSON\nPath: choices[0].message.content ✅
G-->>C: response
end
rect rgb(255, 235, 220)
Note over C,L: Streaming (stream: true)
C->>G: POST /chat/completions\n{"stream": true}
G->>L: forwarded request
L-->>G: HTTP 200 Transfer-Encoding: chunked\nContent-Type: text/event-stream
L-->>G: data: {"choices":[{"delta":{"content":"Hello"}}]}\n\n
L-->>G: data: {"choices":[{"delta":{"content":" world"}}]}\n\n
L-->>G: data: [DONE]\n\n
Note over G: Aggregated SSE (if BUFFERED chain)\nPath: choices[0].delta.content ✅\nbut body is NOT valid JSON ❌
G-->>C: response
end
5.2 — The fix: two explicit path configs in Policy Definition
# v1 — one jsonPath, broken for SSE
response:
jsonPath: "$.choices[0].message.content" # fails on SSE body, wrong path anyway
# v2 — format-aware, two separate paths
response:
min: 10
max: 1000
jsonPath: "$.choices[0].message.content" # plain JSON (stream: false)
streamingJsonPath: "$.choices[0].delta.content" # per SSE event (stream: true)6. Summary
For policy authors
- Implement only the sub-interfaces for the phases you care about — unused phases cost nothing at runtime
- The action type hierarchy prevents attempting mutations that are physically impossible at a given phase
ChunkBufferingwith utility functions means accumulation logic is one line per policy- All parameter parsing goes in
GetPolicy— hook methods are pure, allocation-free, and fast
For operators
- One
policy-definition.yamlper policy, validated by the control plane before the policy code is ever invoked response.jsonPath(plain JSON) andresponse.streamingJsonPath(per SSE event) explicitly address both LLM response shapes- Both path fields default to empty with auto-extraction — zero config required for standard OpenAI usage
For the runtime
- Streaming mode is decided at
ResponseHeaders— after upstream response type is known — preventingContent-Lengthstripping on non-streaming responses - Chunk suppression during the hold phase means no partially-processed data (e.g. unmasked PII) ever reaches the client
- Mixed chains (streaming + buffered policies) degrade gracefully to
BUFFEREDwith all policies receiving the full body through their correct hook
Sample implementation:
- Policy Interface: https://github.com/Thushani-Jayasekera/envoy-ext-proc-body-streaming/blob/feature/body-streaming-sse/policy-v2/policy/interface.go
- Actions: https://github.com/Thushani-Jayasekera/envoy-ext-proc-body-streaming/blob/feature/body-streaming-sse/policy-v2/policy/action.go
- Word count guardrail policy - https://github.com/Thushani-Jayasekera/envoy-ext-proc-body-streaming/tree/feature/body-streaming-sse/policy-v2/policies/word-count-guardrail
Beta Was this translation helpful? Give feedback.
-
Design Discussion: Policy Interface Options Considered and RejectedThese are the interface designs we evaluated for The problem we were solving:
Option 1 — Unified Body Method with Phase Flags in ContextIdeaCollapse all processing into two methods. The same method handles header decisions, streaming chunks, and buffered bodies. The phase is communicated through fields on the context object ( type Policy interface {
OnRequestBody(ctx *RequestBodyContext, params map[string]interface{}) RequestAction
// ↑ ctx.IsHeader / ctx.IsStreamingBody / ctx.IsBufferingBody
// ↑ ctx.Headers always available
OnResponseBody(ctx *ResponseBodyContext, params map[string]interface{}) ResponseAction
// ↑ ctx.RequestHeaders + ctx.ResponseHeaders always available
// ↑ ctx.IsHeader / ctx.IsStreamingBody / ctx.IsBufferingBody
}The policy author switches on the phase flag: func (p *MyPolicy) OnResponseBody(ctx *ResponseBodyContext, ...) ResponseAction {
if ctx.IsHeader {
// header-only decision
} else if ctx.IsStreamingBody {
// process one chunk
} else {
// process full buffered body
}
}Pros
Cons and Reason for RejectionThe type system cannot enforce phase-specific constraints. The phase check is an invisible runtime contract. A "header-only" policy is indistinguishable from a "body" policy.
Decision: Rejected. Option 2 — Single Interface, All Six Methods (Monolithic)IdeaSeparate the method per phase explicitly, keeping all six methods on a single type Policy interface {
OnRequestHeaders(ctx *RequestHeaderContext, params map[string]interface{}) HeaderAction
OnResponseHeaders(ctx *ResponseHeaderContext, params map[string]interface{}) HeaderAction
OnRequestBody(ctx *RequestBodyContext, params map[string]interface{}) RequestBodyAction
// full buffered request body; ctx.Headers available
OnRequestBodyChunk(ctx *RequestStreamContext, chunk *Body, params map[string]interface{}) RequestBodyChunkAction
// called once per streaming request chunk
OnResponseBody(ctx *ResponseBodyContext, params map[string]interface{}) ResponseAction
// full buffered response body; ctx.RequestHeaders + ctx.ResponseHeaders available
OnResponseBodyChunk(ctx *ResponseStreamContext, chunk *Body, params map[string]interface{}) ResponseBodyChunkAction
// called once per streaming response chunk
}Pros
Cons and Reason for RejectionEvery policy must implement all six methods. The kernel cannot skip phases it does not need. Need a method to identify the mode of the policy and its capabilities Decision: Rejected. Option 3 — Sub-Interfaces for header and body processing Same Action Types for streaming and buffered bodiesIdeaSplit the 4 methods across 4 separate interfaces. A policy implements only the sub-interfaces for the phases it cares about. The kernel inspects the chain at startup to discover capabilities. However, ResponseBodyPolicy sub-interface will implement OnResponseBody method to handle both streaming and non streaming responses. Therefore, it will allow the same ResponseBodyAction type RequestHeaderPolicy interface {
OnRequestHeaders(ctx *RequestHeaderContext, params map[string]interface{}) HeaderAction
}
type ResponseHeaderPolicy interface {
OnResponseHeaders(ctx *ResponseHeaderContext, params map[string]interface{}) HeaderAction
}
type RequestBodyPolicy interface {
OnRequestBody(ctx *RequestBodyContext, params map[string]interface{}) RequestBodyAction
}
type ResponseBodyPolicy interface {
OnResponseBody(ctx *ResponseBodyContext, params map[string]interface{}) ResponseBodyAction
}Pros
Cons and Reason for RejectionThe action types are still too broad.
Decision: Rejected. Close, but the action type problem make this insufficient. |
Beta Was this translation helpful? Give feedback.

Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
1. Summary
This RFC proposes adding streaming policy support to the Policy Engine, enabling chunk-by-chunk processing of LLM responses. This allows AI guardrail policies (PII masking, content moderation, etc.) to process streaming responses from LLM providers while preserving the real-time streaming experience for end users.
2. Motivation
2.1. Primary Use Case: Real-Time LLM Streaming Responses
LLM providers (OpenAI, Azure OpenAI, Anthropic, etc.) return responses via SSE (Server-Sent Events) or HTTP chunked transfer encoding. The LLM generates tokens incrementally, and these tokens are streamed to the client in real-time.
Without streaming policy support:
With streaming policy support:
The streaming experience is critical for AI applications:
2.2. Secondary Considerations
3. Current State
The existing
Policyinterface assumes buffered body processing:When
BodyModeBufferis set, the kernel waits for the complete body before invokingOnRequestorOnResponse.4. Proposed Design Options
This section presents two interface design options. Option 2 (Independent Composable Interfaces) is recommended.
4.1. Option 1: Split by Buffered vs Streaming (with Mode function)
This option uses four interfaces split by processing mode, each with a
Mode()function.4.1.1. Option 1A: Four Core Interfaces
Drawbacks:
Mode()function (redundant with interface choice)4.1.2. Option 1B: With Separate HeaderOnlyPolicy Interface
Add dedicated interfaces for header-only processing:
Drawbacks:
Mode()function for body policiesOnRequestHeadersvs embedded in streaming)4.2. Option 2: Independent Composable Interfaces (Recommended)
Each interface is independent - no embedding, no
Mode()function. Policies implement whichever combination they need:4.3. Interface IS the Mode Declaration
No
Mode()function needed. The kernel determines processing mode by checking which interfaces the policy implements:RequestHeaderPolicyonlyRequestBodyPolicyonlyRequestHeaderPolicy+RequestBodyPolicyRequestHeaderPolicy+StreamingRequestBodyPolicyRequestBodyPolicy+StreamingRequestBodyPolicy4.4. Policy Composition Examples
RequestHeaderPolicyResponseHeaderPolicyRequestHeaderPolicy+RequestBodyPolicyResponseHeaderPolicy+ResponseBodyPolicyRequestBodyPolicyResponseBodyPolicy+StreamingResponseBodyPolicyRequestHeaderPolicy+StreamingRequestBodyPolicyStreamingResponseBodyPolicy4.5. Option 2 Benefits
Mode()function - interface IS the declaration4.6. Comparison of Options
OnRequestvsOnRequestHeaders)OnRequestvsOnRequestHeaders)On+Request/Response+Headers/Body/BodyChunk)Why Option 2 is recommended:
Mode()function - interface IS the declaration5. Sample Policy Implementations
5.1. ModifyHeaders Policy
Use Case: Add, remove, or modify HTTP headers without touching the body.
Interfaces:
RequestHeaderPolicy+ResponseHeaderPolicy(header-only, works in any route)Note
Clean header-only implementation! With the composable interface design, this policy only implements 2 interfaces and 2 methods. No body-related boilerplate needed.
5.2. JsonToXml Policy
Use Case: Transform JSON request/response bodies to XML format.
Interfaces:
RequestHeaderPolicy+RequestBodyPolicy+ResponseHeaderPolicy+ResponseBodyPolicy(buffered only)Why buffered only? JSON parsing requires the complete document structure. You cannot transform
{"name": "Joto XML until you have the complete JSON object.Buffering Behavior: When this policy is in a route, the Router (Envoy) buffers the body. If the payload exceeds the configured limit (e.g., 10MB), Router returns HTTP 413 Payload Too Large.
5.3. PII Masking Policy (AI Guardrail)
Use Case: Scan request prompts and LLM responses for PII (emails, phone numbers, SSN, etc.) and mask them before they reach the client.
Interfaces:
RequestBodyPolicy(buffered) - need complete prompt to validateResponseBodyPolicy+StreamingResponseBodyPolicy(dual-support)Why streaming matters here: When a user asks an LLM "What's John's email?", the LLM might respond with "John's email is [email protected]". With streaming:
Why this combination?
ResponseBodyPolicyandStreamingResponseBodyPolicyto work in any route configurationDual-support benefit: When paired with JsonToXml (buffered-only), uses
OnResponseBody. When in streaming route, usesOnResponseBodyChunk.5.4. Virus Scan Policy
Use Case: Scan uploaded files for malware using streaming virus scanner (e.g., ClamAV). This is a non-AI use case where streaming is required due to large file sizes.
Interfaces:
RequestHeaderPolicy+StreamingRequestBodyPolicy+StreamingResponseBodyPolicy(streaming only)Note: This policy differs from AI guardrail policies. AI policies stream because LLMs produce streaming responses (SSE/chunked). Virus scan streams because files can be very large (100MB+) and cannot be buffered in memory.
Why streaming only? Virus scanning large files (100MB+) cannot buffer the entire file in memory. The scanner processes chunks as they arrive, maintaining internal state, and delivers verdict on the final chunk.
6. Route Compatibility Matrix
6.1. Compatibility Rules
RequestBodyPolicy(notStreamingRequestBodyPolicy), the request path uses buffered modeStreamingRequestBodyPolicy, the request path uses streaming modeRequestHeaderPolicy) are compatible with both modesThe same rules apply to the response path with corresponding response interfaces.
6.1.1. Mode Determination Priority
When determining the processing mode for a route, the following priority rules apply:
If any policy is streaming-only (implements only
StreamingRequestBodyPolicy, notRequestBodyPolicy): Route uses streaming mode. If another policy is buffered-only, the combination is incompatible.If any policy is buffered-only (implements only
RequestBodyPolicy, notStreamingRequestBodyPolicy): Route uses buffered mode. If another policy is streaming-only, the combination is incompatible.If all body-processing policies are dual-support (implement both
RequestBodyPolicyANDStreamingRequestBodyPolicy): Route defaults to streaming mode (preferred for real-time UX, especially in AI Gateway scenarios).Header-only policies do not affect mode determination.
6.2. Example Combinations
Note
Why not allow VirusScan in buffered mode when payload is small?
Processing mode is determined at configuration time based on interface implementations, not at runtime based on payload size. When PIIMasking requires buffered mode, the kernel calls
OnRequestBody()- but VirusScan only implementsOnRequestBodyChunk(). The kernel has no method to call.If the VirusScan policy author wants compatibility with buffered routes, they can implement both
RequestBodyPolicyANDStreamingRequestBodyPolicy(dual-support pattern). The incompatibility exists because the policy author chose streaming-only - the Gateway Controller cannot make a streaming-only policy work in buffered mode.7. Policy Compatibility Validation
7.1. The Problem
Policy compatibility must be validated at design time, not deployment time:
7.2. Architecture: Policy Definition as Source of Truth
The Policy Definition (with its
supportsmetadata) is the source of truth for compatibility validation. Policy Hub is where built-in policies reside, but users can also add custom policies with their own Policy Definitions.flowchart TB PH[Policy Hub<br/>built-in policies] CP[Custom Policies<br/>user-defined] PH --> PD[Policy Definition<br/>supports metadata] CP --> PD PD --> PA[Platform API<br/>design-time validation] PD --> GC[Gateway Controller<br/>deploy-time validation] PD --> PE[Policy Engine<br/>runtime]7.3. Policy Definition
Each policy definition includes
supportsmetadata for compatibility validation:7.4. Validation at Each Layer
supportsdeclarationsImportant
Gateway Builder Validation: The Gateway Builder performs build-time validation to ensure that the Policy Definition YAML accurately reflects the interfaces implemented by the policy code. For example, if a policy's YAML declares
supports.response.streaming: true, the builder verifies that the policy implementsStreamingResponseBodyPolicy. This prevents mismatches between declared capabilities and actual implementation.7.5. Platform API Validation Logic
flowchart TD A[User adds policy to route] --> B{Get existing policies<br/>on route} B --> C{For each policy pair} C --> D{Either policy is<br/>header-only?} D -->|Yes| E[✅ Compatible<br/>header-only works with any mode] D -->|No| F{Both support<br/>buffered?} F -->|Yes| G[Compatible via buffered] F -->|No| H{Both support<br/>streaming?} H -->|Yes| I[Compatible via streaming] H -->|No| J[❌ Incompatible] E --> K[✅ Allow] G --> K I --> K J --> L[Return error]7.6. Error Response
{ "error": { "code": "INCOMPATIBLE_POLICIES", "message": "Cannot add 'virus-scan' to route '/upload'", "details": { "conflict": "request", "existing": { "name": "json-to-xml", "supports": ["buffered"] }, "adding": { "name": "virus-scan", "supports": ["streaming"] }, "suggestion": "Remove 'json-to-xml' or choose a buffered-compatible policy" } } }8. Kernel Processing Flow
8.1. Buffered Request Path
8.2. Streaming Request Path
8.3. Mixed Mode (Buffered Request + Streaming Response) - AI Gateway Primary Use Case
This is the most common pattern for AI Gateway: buffer the complete prompt, stream the LLM response.
The client sees the response appearing in real-time, just as if there were no gateway in between.
9. Policies That Require Chunk Accumulation
9.1. The Problem
The Semantic Completeness Challenge
LLM streaming responses arrive as a sequence of small chunks, where each chunk typically contains only 1-2 tokens. This creates a fundamental architectural challenge for certain categories of policies.
Why Some Policies Cannot Process Individual Chunks
Consider AI guardrail policies such as content safety, toxicity detection, or hallucination filtering. These policies:
Require semantic completeness - They need complete sentences or phrases to make meaningful decisions. A toxicity detector cannot determine if "I'm going to kill" is harmful without seeing what follows ("...time at the arcade" vs "...you").
Depend on external AI services - These policies often call external LLM providers or specialized ML models for analysis. Making an API call for every 1-2 token chunk is neither practical nor cost-effective.
May transform the payload - Content safety policies might redact, rephrase, or block content entirely. The decision on how to transform depends on analyzing sufficient context.
The Policy Chain Coordination Problem
This creates a critical coordination challenge in the policy execution pipeline:
When Policy A (content safety) determines it needs to buffer chunks to form a complete sentence:
This means a buffering policy effectively pauses the entire downstream pipeline for the chunks it's accumulating. The policy engine needs a mechanism for policies to signal "I need more data before I can make a decision, hold everything downstream."
Contrast with Pattern-Based Policies
This is distinct from simpler pattern-matching scenarios (like PII detection where an email is split across chunks). Pattern matching has predictable, bounded buffering needs. Semantic analysis policies have variable buffering requirements - they need "enough tokens to form a complete thought," which varies based on content.
The Architectural Question
How should the policy engine handle policies that:
9.2. Proposed Solutions Analysis
This section analyzes three proposed solutions for handling chunk accumulation.
9.2.1. Solution A: Pre-Processing Check (NeedsMoreData Interface)
Concept: Add a function
NeedsMoreData(accumulated []byte) boolthat streaming policies can implement. Before processing each accumulated chunk cycle, the policy engine calls this function on streaming policies that implement it. Processing only begins when all policies returnfalse(ready) or EndOfStream is reached.Why two separate interfaces?
A policy may support both request and response streaming with different accumulation logic for each:
With separate interfaces, the policy can implement different logic for each path.
Note: These are separate interfaces from the streaming body policy interfaces. Streaming policies that need accumulation implement BOTH their streaming interface AND the corresponding accumulator:
StreamingRequestBodyPolicy+StreamingRequestAccumulatorStreamingResponseBodyPolicy+StreamingResponseAccumulatorSimple streaming policies (logging, metrics, PII regex) only implement the streaming body interface without any accumulator.
Flow:
Key Semantics:
NeedsMoreDatareceives RAW upstream bytes (before any policy processing)NeedsMoreDatais only called BEFORE chain processing, never mid-chainNeedsMoreDataresultIssue Analysis:
NeedsMoreDatareceives raw bytes, butOnResponseBodyChunkreceives A's modified output. B's "ready" decision was based on different content than what it processes.Remaining Considerations:
Verdict: ✅ Viable solution with clear semantics. Trade-offs are acceptable and inherent to the problem.
9.2.2. Solution B: Static Minimum Byte/Chunk Count
Concept: Each policy declares upfront a static requirement: "I need at least X bytes or Y chunks". Policy Engine buffers until all requirements are met.
Issue Analysis:
Verdict: ❌ Not viable for semantic analysis use cases. May work only for fixed-format protocols (e.g., "buffer until newline").
9.2.3. Solution C: Action-Based Buffering Instruction
Concept: Policy returns an action like
BufferMore{}mid-chain, telling the engine to accumulate more chunks before continuing to downstream policies.Flow:
Clarification: If Policy A also needs semantic context, A should implement
StreamingAccumulatorand the kernel would wait for both A and B before processing. The flow above assumes A is a simple streaming policy (logging, transform) that intentionally processes each chunk individually.Issue Analysis:
Flow Clarification:
Why This Adds Complexity:
Verdict: ❌ Mid-chain buffering requires complex state tracking in the kernel. Solution A avoids this by making the buffering decision BEFORE any processing starts - no mid-chain state to track.
9.3. Solution Comparison
Recommendation: Solution A (Pre-Processing Check) is the recommended approach for policies requiring chunk accumulation.
9.4. Solution A - Detailed Design
9.4.1. Interface Definition
9.4.2. Kernel Processing Flow
9.4.3. Example: Content Safety Policy
9.4.4. Interface Relationship Summary
StreamingResponseBodyPolicyStreamingRequestBodyPolicyStreamingResponseBodyPolicy+StreamingResponseAccumulatorStreamingRequestBodyPolicy+StreamingRequestAccumulatorRequestBodyPolicy/ResponseBodyPolicyRequestHeaderPolicy/ResponseHeaderPolicy10. Design Decisions
10.1. Interface Design Approach
Decision: Use independent composable interfaces (Option 2) without a
Mode()function.Rationale: Option 2 provides maximum composability - policies implement only the interfaces they need. No boilerplate methods, no mode declaration redundancy. The interface a policy implements IS its mode declaration.
Alternatives Considered:
10.2. Mode Declaration Mechanism
Decision: The interface a policy implements declares its processing mode. No explicit
Mode()function needed.Rationale: The kernel determines mode by type assertion:
This eliminates redundancy - if a policy implements
StreamingRequestBodyPolicy, it inherently declares streaming support.10.3. Route Processing Mode Determination
Decision: Infer processing mode from the combined interface implementations of all policies in a route.
Rules:
RequestBodyPolicy(notStreamingRequestBodyPolicy), the request path uses buffered modeStreamingRequestBodyPolicy, the request path uses streaming modeRequestHeaderPolicy) are compatible with both modes10.4. Headers and Body as Independent Concerns
Decision: Header processing and body processing are separate, independent interfaces.
Rationale: A policy may need to:
Independent interfaces allow any combination without forcing unnecessary implementations.
10.5. Dual-Support Pattern for Body Processing
Decision: Policies can implement both
ResponseBodyPolicy(buffered) ANDStreamingResponseBodyPolicy(streaming) to work in either mode.Rationale: This enables policies like PIIMasking to:
The kernel selects which method to call based on the route's determined mode.
10.6. Policy Compatibility Validation Timing
Decision: Validate policy compatibility at design time (Platform API), not just deployment time.
Rationale: Better user experience - users learn about incompatibilities immediately when configuring, not after attempting deployment. Gateway Controller performs defense-in-depth validation at deploy time.
10.7. Policy Definition as Source of Truth
Decision: The Policy Definition (YAML with
supportsmetadata) is the source of truth for compatibility validation.Rationale: Policy Definition is available to both Platform API (design-time validation) and Gateway Controller (deploy-time validation). Policy Hub stores built-in policies; users can add custom policies with their own definitions.
10.8. Build-Time Validation by Gateway Builder
Decision: The Gateway Builder performs build-time validation to ensure Policy Definition YAML matches the interfaces implemented by the policy code.
Rationale: Compile-time validation catches mismatches early, before deployment. The builder checks:
supports.request.streaming: true, policy must implementStreamingRequestBodyPolicysupports.response.buffered: true, policy must implementResponseBodyPolicyThis ensures the Policy Definition accurately reflects the policy's actual capabilities, preventing runtime surprises.
10.9. Chunk Accumulation Approach
Decision: Use Solution A (Pre-Processing Check) with
NeedsMoreDatainterface for policies requiring chunk accumulation.Rationale: Solution A makes the buffering decision BEFORE any processing starts, avoiding complex mid-chain state tracking. The kernel accumulates raw bytes and only processes when all accumulator policies signal readiness (or EndOfStream is reached).
Alternatives Considered:
10.10. Accumulator Interface Design
Decision: Use separate optional interfaces (
StreamingRequestAccumulator,StreamingResponseAccumulator) that streaming policies implement IN ADDITION to their body policy interface.Rationale:
StreamingResponseBodyPolicy+StreamingResponseAccumulator10.11. NeedsMoreData Receives Raw Bytes
Decision:
NeedsMoreData(accumulated []byte)receives raw upstream bytes, not content modified by earlier policies.Rationale: The accumulation decision happens BEFORE chain processing. All policies make their "ready" decision based on the same raw content. After all policies are ready, chain processing runs once with the accumulated content.
10.12. Policy Instance Lifecycle
Decision: Policy instance lives for the entire request/response lifecycle.
Rationale: The same policy instance handles all chunks in a stream. This allows policies to maintain internal state (e.g., accumulated data, scan sessions, partial pattern matches) without relying solely on context metadata.
10.13. Error Handling Mid-Stream
Decision: Return
ImmediateResponseaction; Envoy handles connection termination.Rationale: When a streaming policy returns
ImmediateResponse:This is consistent with buffered mode error handling and leverages Envoy's built-in stream management.
10.14. Memory Protection for Accumulation
Decision: Configurable maximum accumulation buffer size with error response when exceeded.
Rationale: Unbounded accumulation could lead to memory exhaustion. When the buffer exceeds the configured limit, the kernel returns HTTP 413 (Payload Too Large) rather than risking OOM conditions.
10.15. Backpressure Handling via Chunk Batching
Decision: When the Policy Engine receives chunks faster than it can process them, it batches multiple chunks together and processes them as a single accumulated chunk in the next processing cycle.
Rationale: Rather than implementing complex backpressure signaling to Envoy, the kernel absorbs temporary processing delays by accumulating incoming chunks. On the next processing cycle, all accumulated chunks are sent through the policy chain together. This:
EndOfStream)Behavior:
10.16. Header Modification Limitation in Streaming Mode
Decision: Header modifications via
OnRequestHeadersorOnResponseHeadersare final and cannot be changed based on body content discovered during streaming.Rationale: In streaming mode, headers are forwarded to the client/upstream before body processing begins. This is a fundamental constraint of HTTP streaming - headers must be sent before the body. Policies that need to modify headers based on body inspection must use buffered mode.
Limitation: If a policy discovers something during body chunk processing that should have affected headers (e.g., determining
Content-Typefrom body inspection), it cannot retroactively modify those headers. Policies requiring such behavior should:This is an accepted architectural limitation inherent to HTTP streaming semantics.
10.17. Dual-Support Mode Selection
Decision: When all body-processing policies in a route implement dual-support (both buffered and streaming interfaces), the kernel defaults to streaming mode.
Rationale: Streaming mode provides better real-time UX, which is the primary use case for AI Gateway. If a route author wants to force buffered mode, they can add a buffered-only policy or configure the route explicitly (future enhancement).
11. TODO / Future Work
The following items are identified for future design and implementation:
11.1. Testing Strategy
11.2. Observability
12. References
sdk/gateway/policy/v1alpha/interface.goBeta Was this translation helpful? Give feedback.
All reactions