Memlayer supports streaming responses from all providers (OpenAI, Claude, Gemini, Ollama). This guide explains how streaming works, its performance characteristics, and best practices.
Streaming mode yields response chunks as they're generated by the LLM, rather than waiting for the complete response. This provides a better user experience with faster perceived latency.
from memlayer.wrappers.openai import OpenAI
client = OpenAI(model="gpt-4.1-mini", user_id="alice")
# Enable streaming with stream=True
for chunk in client.chat(
[{"role": "user", "content": "Tell me about machine learning"}],
stream=True
):
print(chunk, end="", flush=True)
print() # Newline after completionfrom memlayer.wrappers.claude import Claude
client = Claude(model="claude-3-5-sonnet-20241022", user_id="alice")
for chunk in client.chat(
[{"role": "user", "content": "Explain quantum computing"}],
stream=True
):
print(chunk, end="", flush=True)from memlayer.wrappers.gemini import Gemini
client = Gemini(model="gemini-2.5-flash", user_id="alice")
for chunk in client.chat(
[{"role": "user", "content": "What is a neural network?"}],
stream=True
):
print(chunk, end="", flush=True)from memlayer.wrappers.ollama import Ollama
client = Ollama(model="llama3.2", user_id="alice")
for chunk in client.chat(
[{"role": "user", "content": "Describe photosynthesis"}],
stream=True
):
print(chunk, end="", flush=True)When using memory search, the LLM may call the search_memory tool before streaming:
# Message that triggers memory search
for chunk in client.chat([
{"role": "user", "content": "What did I tell you about my project?"}
], stream=True):
print(chunk, end="", flush=True)
# Timeline:
# t=0ms: Message sent
# t=50ms: LLM calls search_memory tool
# t=250ms: Search completes, context retrieved
# t=1500ms: First chunk arrives (including search time)
# t=1500ms+: Chunks stream in real-timeAfter streaming completes, Memlayer extracts knowledge in a background thread:
for chunk in client.chat([
{"role": "user", "content": "I work at Acme Corp"}
], stream=True):
print(chunk, end="", flush=True)
# Your code continues immediately after streaming
print("Streaming done!") # This prints right away
# Meanwhile, in background:
# - Salience check (~100ms, cached after first run)
# - Knowledge extraction API call (1-2s, fast model)
# - Graph update (~50ms)- Overview: Understand the full architecture
- Quickstart: Get started in 5 minutes
- Operation Modes: Choose the right mode
- Examples: See complete working code