feat: support multimodal tool outputs (text + image)#4955
feat: support multimodal tool outputs (text + image)#4955tpirc3 wants to merge 4 commits intolivekit:mainfrom
Conversation
| def _to_chat_tool_output_content(output: Any) -> str | list[dict[str, Any]]: | ||
| text_parts, image_parts = llm.utils.split_tool_output_parts(output) | ||
| if not image_parts: | ||
| return llm.utils.tool_output_to_text(output, include_image_placeholder=False) | ||
|
|
||
| parts: list[dict[str, Any]] = [{"type": "text", "text": text} for text in text_parts] | ||
| for image in image_parts: | ||
| try: | ||
| parts.append(_to_image_content(image)) | ||
| except ValueError as e: | ||
| logger.warning( | ||
| "Failed to serialize tool output image for openai chat format", exc_info=e | ||
| ) | ||
| parts.append({"type": "text", "text": llm.utils.TOOL_OUTPUT_IMAGE_PLACEHOLDER}) | ||
|
|
||
| return parts |
There was a problem hiding this comment.
🟡 _to_chat_tool_output_content loses interleaving order of text and image parts
When supports_tool_image_output=True is enabled, the _to_chat_tool_output_content function uses split_tool_output_parts which separates text and image parts into two independent lists, then reconstructs the content with all text parts first followed by all image parts. This destroys the original interleaving order.
Root Cause and Impact
split_tool_output_parts at livekit-agents/livekit/agents/llm/utils.py:528-538 collects text into one list and images into another. Then _to_chat_tool_output_content builds the output as [*text_parts, *image_parts]:
parts: list[dict[str, Any]] = [{"type": "text", "text": text} for text in text_parts]
for image in image_parts:
parts.append(_to_image_content(image))For an output like ["before", image1, "middle", image2, "after"], the result is:
[text("before"), text("middle"), text("after"), image1, image2]
instead of the expected:
[text("before"), image1, text("middle"), image2, text("after")]
This contrasts with the AWS and Anthropic providers, which correctly use tool_output_parts (preserving order) to iterate and build content parts sequentially. The LLM receiving misordered content may misinterpret which text describes which image.
| def _to_chat_tool_output_content(output: Any) -> str | list[dict[str, Any]]: | |
| text_parts, image_parts = llm.utils.split_tool_output_parts(output) | |
| if not image_parts: | |
| return llm.utils.tool_output_to_text(output, include_image_placeholder=False) | |
| parts: list[dict[str, Any]] = [{"type": "text", "text": text} for text in text_parts] | |
| for image in image_parts: | |
| try: | |
| parts.append(_to_image_content(image)) | |
| except ValueError as e: | |
| logger.warning( | |
| "Failed to serialize tool output image for openai chat format", exc_info=e | |
| ) | |
| parts.append({"type": "text", "text": llm.utils.TOOL_OUTPUT_IMAGE_PLACEHOLDER}) | |
| return parts | |
| def _to_chat_tool_output_content(output: Any) -> str | list[dict[str, Any]]: | |
| text_parts, image_parts = llm.utils.split_tool_output_parts(output) | |
| if not image_parts: | |
| return llm.utils.tool_output_to_text(output, include_image_placeholder=False) | |
| parts: list[dict[str, Any]] = [] | |
| for part in llm.utils.tool_output_parts(output): | |
| if isinstance(part, str): | |
| parts.append({"type": "text", "text": part}) | |
| else: | |
| try: | |
| parts.append(_to_image_content(part)) | |
| except ValueError as e: | |
| logger.warning( | |
| "Failed to serialize tool output image for openai chat format", exc_info=e | |
| ) | |
| parts.append({"type": "text", "text": llm.utils.TOOL_OUTPUT_IMAGE_PLACEHOLDER}) | |
| return parts |
Was this helpful? React with 👍 or 👎 to provide feedback.
|
Sorry for the confusion with these two PRs. |
Summary
FunctionCallOutput.outputto supportImageContentandlist[str | ImageContent]supports_tool_image_outputto support Qwen-compatible providers that accept tool image contentWhy
Closes #4893
Tests
make checkuv run pytest tests/test_tool_output_multimodal.py tests/test_chat_ctx.py tests/test_tools.py -q