Skip to content

feat: support multimodal tool outputs (text + image)#4955

Closed
tpirc3 wants to merge 4 commits intolivekit:mainfrom
tpirc3:feature/multimodal-tool
Closed

feat: support multimodal tool outputs (text + image)#4955
tpirc3 wants to merge 4 commits intolivekit:mainfrom
tpirc3:feature/multimodal-tool

Conversation

@tpirc3
Copy link

@tpirc3 tpirc3 commented Feb 26, 2026

Summary

  • extend FunctionCallOutput.output to support ImageContent and list[str | ImageContent]
  • add shared normalization/splitting/text-fallback helpers for tool outputs
  • implement provider-specific handling for tool-result images:
    • native: OpenAI Responses, Anthropic, Google, AWS standard LLM path
    • fallback to text placeholder for unsupported paths (OpenAI chat default, Mistral, realtime variants)
  • add optional OpenAI chat-completions flag supports_tool_image_output to support Qwen-compatible providers that accept tool image content
  • fix telemetry/span/type assumptions by converting multimodal outputs to text where string-only sinks are required

Why

Closes #4893

Tests

  • make check
  • uv run pytest tests/test_tool_output_multimodal.py tests/test_chat_ctx.py tests/test_tools.py -q

@CLAassistant
Copy link

CLAassistant commented Feb 26, 2026

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View 6 additional findings in Devin Review.

Open in Devin Review

Comment on lines +139 to +154
def _to_chat_tool_output_content(output: Any) -> str | list[dict[str, Any]]:
text_parts, image_parts = llm.utils.split_tool_output_parts(output)
if not image_parts:
return llm.utils.tool_output_to_text(output, include_image_placeholder=False)

parts: list[dict[str, Any]] = [{"type": "text", "text": text} for text in text_parts]
for image in image_parts:
try:
parts.append(_to_image_content(image))
except ValueError as e:
logger.warning(
"Failed to serialize tool output image for openai chat format", exc_info=e
)
parts.append({"type": "text", "text": llm.utils.TOOL_OUTPUT_IMAGE_PLACEHOLDER})

return parts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 _to_chat_tool_output_content loses interleaving order of text and image parts

When supports_tool_image_output=True is enabled, the _to_chat_tool_output_content function uses split_tool_output_parts which separates text and image parts into two independent lists, then reconstructs the content with all text parts first followed by all image parts. This destroys the original interleaving order.

Root Cause and Impact

split_tool_output_parts at livekit-agents/livekit/agents/llm/utils.py:528-538 collects text into one list and images into another. Then _to_chat_tool_output_content builds the output as [*text_parts, *image_parts]:

parts: list[dict[str, Any]] = [{"type": "text", "text": text} for text in text_parts]
for image in image_parts:
    parts.append(_to_image_content(image))

For an output like ["before", image1, "middle", image2, "after"], the result is:
[text("before"), text("middle"), text("after"), image1, image2]
instead of the expected:
[text("before"), image1, text("middle"), image2, text("after")]

This contrasts with the AWS and Anthropic providers, which correctly use tool_output_parts (preserving order) to iterate and build content parts sequentially. The LLM receiving misordered content may misinterpret which text describes which image.

Suggested change
def _to_chat_tool_output_content(output: Any) -> str | list[dict[str, Any]]:
text_parts, image_parts = llm.utils.split_tool_output_parts(output)
if not image_parts:
return llm.utils.tool_output_to_text(output, include_image_placeholder=False)
parts: list[dict[str, Any]] = [{"type": "text", "text": text} for text in text_parts]
for image in image_parts:
try:
parts.append(_to_image_content(image))
except ValueError as e:
logger.warning(
"Failed to serialize tool output image for openai chat format", exc_info=e
)
parts.append({"type": "text", "text": llm.utils.TOOL_OUTPUT_IMAGE_PLACEHOLDER})
return parts
def _to_chat_tool_output_content(output: Any) -> str | list[dict[str, Any]]:
text_parts, image_parts = llm.utils.split_tool_output_parts(output)
if not image_parts:
return llm.utils.tool_output_to_text(output, include_image_placeholder=False)
parts: list[dict[str, Any]] = []
for part in llm.utils.tool_output_parts(output):
if isinstance(part, str):
parts.append({"type": "text", "text": part})
else:
try:
parts.append(_to_image_content(part))
except ValueError as e:
logger.warning(
"Failed to serialize tool output image for openai chat format", exc_info=e
)
parts.append({"type": "text", "text": llm.utils.TOOL_OUTPUT_IMAGE_PLACEHOLDER})
return parts
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

@tpirc3
Copy link
Author

tpirc3 commented Feb 26, 2026

Sorry for the confusion with these two PRs.

@tpirc3 tpirc3 closed this Feb 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: support ImageContent in tool return value

2 participants