feat: support multimodal tool outputs (text + image) by tpirc3 · Pull Request #4955 · livekit/agents

tpirc3 · 2026-02-26T12:04:29Z

Summary

extend FunctionCallOutput.output to support ImageContent and list[str | ImageContent]
add shared normalization/splitting/text-fallback helpers for tool outputs
implement provider-specific handling for tool-result images:
- native: OpenAI Responses, Anthropic, Google, AWS standard LLM path
- fallback to text placeholder for unsupported paths (OpenAI chat default, Mistral, realtime variants)
add optional OpenAI chat-completions flag supports_tool_image_output to support Qwen-compatible providers that accept tool image content
fix telemetry/span/type assumptions by converting multimodal outputs to text where string-only sinks are required

Why

Closes #4893

Tests

make check
uv run pytest tests/test_tool_output_multimodal.py tests/test_chat_ctx.py tests/test_tools.py -q

CLAassistant · 2026-02-26T12:04:41Z

All committers have signed the CLA.

devin-ai-integration

Devin Review found 1 potential issue.

View 6 additional findings in Devin Review.

devin-ai-integration · 2026-02-26T12:08:08Z

livekit-agents/livekit/agents/llm/_provider_format/openai.py

+def _to_chat_tool_output_content(output: Any) -> str | list[dict[str, Any]]:
+    text_parts, image_parts = llm.utils.split_tool_output_parts(output)
+    if not image_parts:
+        return llm.utils.tool_output_to_text(output, include_image_placeholder=False)
+
+    parts: list[dict[str, Any]] = [{"type": "text", "text": text} for text in text_parts]
+    for image in image_parts:
+        try:
+            parts.append(_to_image_content(image))
+        except ValueError as e:
+            logger.warning(
+                "Failed to serialize tool output image for openai chat format", exc_info=e
+            )
+            parts.append({"type": "text", "text": llm.utils.TOOL_OUTPUT_IMAGE_PLACEHOLDER})
+
+    return parts


🟡 _to_chat_tool_output_content loses interleaving order of text and image parts

When supports_tool_image_output=True is enabled, the _to_chat_tool_output_content function uses split_tool_output_parts which separates text and image parts into two independent lists, then reconstructs the content with all text parts first followed by all image parts. This destroys the original interleaving order.

Root Cause and Impact

split_tool_output_parts at livekit-agents/livekit/agents/llm/utils.py:528-538 collects text into one list and images into another. Then _to_chat_tool_output_content builds the output as [*text_parts, *image_parts]:

parts: list[dict[str, Any]] = [{"type": "text", "text": text} for text in text_parts] for image in image_parts: parts.append(_to_image_content(image))

For an output like ["before", image1, "middle", image2, "after"], the result is:
[text("before"), text("middle"), text("after"), image1, image2]
instead of the expected:
[text("before"), image1, text("middle"), image2, text("after")]

This contrasts with the AWS and Anthropic providers, which correctly use tool_output_parts (preserving order) to iterate and build content parts sequentially. The LLM receiving misordered content may misinterpret which text describes which image.

Suggested change

def _to_chat_tool_output_content(output: Any) -> str | list[dict[str, Any]]:

text_parts, image_parts = llm.utils.split_tool_output_parts(output)

if not image_parts:

return llm.utils.tool_output_to_text(output, include_image_placeholder=False)

parts: list[dict[str, Any]] = [{"type": "text", "text": text} for text in text_parts]

for image in image_parts:

try:

parts.append(_to_image_content(image))

except ValueError as e:

logger.warning(

"Failed to serialize tool output image for openai chat format", exc_info=e

)

parts.append({"type": "text", "text": llm.utils.TOOL_OUTPUT_IMAGE_PLACEHOLDER})

return parts

def _to_chat_tool_output_content(output: Any) -> str | list[dict[str, Any]]:

text_parts, image_parts = llm.utils.split_tool_output_parts(output)

if not image_parts:

return llm.utils.tool_output_to_text(output, include_image_placeholder=False)

parts: list[dict[str, Any]] = []

for part in llm.utils.tool_output_parts(output):

if isinstance(part, str):

parts.append({"type": "text", "text": part})

else:

try:

parts.append(_to_image_content(part))

except ValueError as e:

logger.warning(

"Failed to serialize tool output image for openai chat format", exc_info=e

)

parts.append({"type": "text", "text": llm.utils.TOOL_OUTPUT_IMAGE_PLACEHOLDER})

return parts

Was this helpful? React with 👍 or 👎 to provide feedback.

tpirc3 · 2026-02-26T13:03:40Z

Sorry for the confusion with these two PRs.

tpirc3 added 4 commits February 26, 2026 20:03

feat: function call output handling to support multimodal outputs

030d313

feat: add supports_tool_image_output arg

cab2382

style: format tool output image changes

48d1016

fix: resolve mypy issues for multimodal tool output

cb2b06d

devin-ai-integration bot reviewed Feb 26, 2026

View reviewed changes

tpirc3 closed this Feb 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support multimodal tool outputs (text + image)#4955

feat: support multimodal tool outputs (text + image)#4955
tpirc3 wants to merge 4 commits intolivekit:mainfrom
tpirc3:feature/multimodal-tool

tpirc3 commented Feb 26, 2026

Uh oh!

CLAassistant commented Feb 26, 2026 •

edited

Loading

Uh oh!

devin-ai-integration bot left a comment

Uh oh!

devin-ai-integration bot Feb 26, 2026

Uh oh!

tpirc3 commented Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tpirc3 commented Feb 26, 2026

Summary

Why

Tests

Uh oh!

CLAassistant commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

tpirc3 commented Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CLAassistant commented Feb 26, 2026 •

edited

Loading