feat: Implement relevant metrics involving input image token count#667
Conversation
Try out this PRQuick install: pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@c95a0d45483c60dac56514631b6e373015a8aa2fRecommended with virtual environment (using uv): uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@c95a0d45483c60dac56514631b6e373015a8aa2fLast updated for commit: |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
WalkthroughAdds client/server hybrid token counting for multimodal requests (images), introduces per-record Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/aiperf/records/inference_result_parser.py (1)
386-435:⚠️ Potential issue | 🟠 MajorFix hybrid counting for image-only prompts and honor server output counts when requested.
Two edge cases here:
- For image-only requests,
text_input_token_countbecomesNone, so server prompt tokens are ignored and a misleading “missing usage” warning is logged. This drops image-token metrics even whenusage.prompt_tokensexists.- When
--use-server-token-countis enabled, this path still uses client-side output/reasoning tokenization, which breaks the flag’s semantics and can skew metrics for users relying on server counts.Consider treating missing text as
0when images are present and preferring server output/reasoning counts when the flag is set (fallback to client if server usage is missing).🐛 Suggested fix for hybrid counting edge cases
output_texts, reasoning_texts = self._parse_output_and_reasoning_texts( responses ) output_token_count = self._compute_token_count(tokenizer, output_texts) reasoning_token_count = self._compute_token_count(tokenizer, reasoning_texts) + if self.user_config.endpoint.use_server_token_count: + server_reasoning = self._extract_server_reasoning_token_count(responses) + server_output = self._extract_server_output_token_count( + responses, server_reasoning + ) + if server_reasoning is not None: + reasoning_token_count = server_reasoning + if server_output is not None: + output_token_count = server_output if self._request_has_images(request_record): server_prompt_tokens = self._extract_server_input_token_count(responses) - if server_prompt_tokens is not None and text_input_token_count is not None: - image_input_token_count = server_prompt_tokens - text_input_token_count + if server_prompt_tokens is not None: + effective_text_input = text_input_token_count or 0 + image_input_token_count = server_prompt_tokens - effective_text_input if image_input_token_count < 0: self.warning( f"Server reported fewer prompt tokens ({server_prompt_tokens}) than " f"client-side text token count ({text_input_token_count}). " "Clamping image input tokens to 0." ) image_input_token_count = 0 return TokenCounts( input=server_prompt_tokens, - text_input=text_input_token_count, + text_input=effective_text_input, image_input=image_input_token_count, reasoning=reasoning_token_count, output=output_token_count, ) - else: - self.warning( - "Images detected in input but server did not report usage.prompt_tokens. " - "ISL will reflect text tokens only; image token count is unknown." - ) + self.warning( + "Images detected in input but server did not report usage.prompt_tokens. " + "ISL will reflect text tokens only; image token count is unknown." + )
🤖 Fix all issues with AI agents
In `@tests/unit/metrics/test_image_token_metrics.py`:
- Around line 98-165: Rename the test methods in class
TestImageInputTokenCountMetric to follow test_<function>_<scenario>_<expected>;
e.g., rename test_reads_image_input_from_token_counts to
test_image_input_token_count_metric_reads_image_input_tokens_returns_count,
test_various_image_token_counts to
test_image_input_token_count_metric_various_counts_return_same,
test_multiple_records to
test_image_input_token_count_metric_multiple_records_preserves_order,
test_no_image_input_raises to
test_image_input_token_count_metric_no_image_input_raises_NoMetricValue,
test_no_token_counts_raises to
test_image_input_token_count_metric_no_token_counts_raises_NoMetricValue, and
test_zero_image_tokens to
test_image_input_token_count_metric_zero_image_tokens_returns_zero so the names
reference ImageInputTokenCountMetric / run_image_token_pipeline and match the
mandated pattern.
In `@tests/unit/records/test_inference_result_parser_image_tokens.py`:
- Around line 94-151: The test methods in TestRequestHasImages do not follow the
required naming convention test_<function>_<scenario>_<expected>; rename each
async test method to reflect the function _request_has_images, the scenario, and
the expected result (e.g., test_request_has_images_detects_images_true for
test_detects_images, test_request_has_images_no_images_false for test_no_images,
test_request_has_images_empty_image_contents_false,
test_request_has_images_no_turns_false, and
test_request_has_images_multiple_turns_with_images_true) so the class
TestRequestHasImages and its methods (e.g., test_detects_images, test_no_images,
test_empty_image_contents, test_no_turns, test_multiple_turns_with_images) match
the repo rule.
206333d to
88c430e
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/aiperf/records/inference_result_parser.py`:
- Around line 407-430: The current else lumps two failure modes together; change
the logic in inference_result_parser.py around _request_has_images(),
_extract_server_input_token_count(), and TokenCounts so you handle them
separately: (1) if server_prompt_tokens is not None but text_input_token_count
is None, log a specific warning that client tokenization failed and return a
partial TokenCounts with input set to server_prompt_tokens and other per-field
values set to sensible partial/unknown values (e.g., text_input=0 or None and
image_input left unknown/None) so callers get the server-reported total; (2) if
server_prompt_tokens is None (regardless of text tokenization), keep the
existing warning that server did not report usage.prompt_tokens and do not
return server-derived totals. Use the existing symbols
(_extract_server_input_token_count, text_input_token_count, TokenCounts, and
self.warning) to find where to implement the split.
🧹 Nitpick comments (1)
src/aiperf/metrics/types/image_token_metrics.py (1)
132-165:LARGER_IS_BETTERonImageTokenRatioMetric— intentional?A higher image-to-total token ratio isn't inherently "better" — it's a descriptive metric indicating composition. This flag typically guides optimization direction in dashboards. Worth confirming this is the intended display behavior, though it has no functional impact.
88c430e to
3c4d82a
Compare
3c4d82a to
c95a0d4
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
docs/metrics_reference.md (1)
459-461:⚠️ Potential issue | 🟡 MinorSection note may mislead users about the new image-token metrics.
The existing note says these metrics require "image-capable endpoints (e.g., image generation APIs)." However, the six new image-token metrics (lines 494–594) are for image input scenarios (vision-language models receiving images as input), not image generation. The existing Image Throughput/Latency metrics are for generation endpoints, so the note is only accurate for those two.
Consider splitting the section or updating the note to clarify that the image-token metrics apply to endpoints that accept images as input (e.g., VLMs), while Image Throughput/Latency apply to image-generation endpoints.
🤖 Fix all issues with AI agents
In `@src/aiperf/records/inference_result_parser.py`:
- Around line 216-230: The info log about hybrid token counting is triggered
only when user_config.endpoint.use_server_token_count is true, but it should run
when the code silently augments client-side tokenization (i.e., when
use_server_token_count is false) so users requesting client-side counts see the
one-time notice; update the conditional in the method that contains
_endpoint_supports_text_tokenization and _compute_client_side_token_counts so
the info() message is logged when has_images and
self._endpoint_supports_text_tokenization and not self._image_hybrid_info_logged
and NOT self.user_config.endpoint.use_server_token_count (and still set
self._image_hybrid_info_logged = True), and adjust the log text to clearly state
that client-side tokenization is being supplemented by server-reported totals
for output/reasoning while image token counts are derived client-side; keep the
existing branch for the explicit use_server_token_count case unchanged or
provide a distinct message if desired.
🧹 Nitpick comments (1)
tests/unit/metrics/test_image_token_metrics.py (1)
171-178: Test names in remaining classes don't follow the mandated naming convention.
TestTextInputTokenCountMetric,TestTokensPerImageMetric,TestImageTokenRatioMetric,TestTotalImageInputTokensMetric,TestImageInputTokenThroughputMetric, andTestImageTokenMetricsIntegrationcontain test methods that don't follow thetest_<function>_<scenario>_<expected>pattern. For example,test_reads_text_input_from_token_countsshould be something liketest_text_input_token_count_metric_reads_text_input_returns_count.The naming was fixed for
TestImageInputTokenCountMetricbut the same pattern wasn't applied to the other test classes.As per coding guidelines: "Name tests as
test_<function>_<scenario>_<expected>(e.g.,test_parse_config_missing_field_raises_error)."
| if has_images and self._endpoint_supports_text_tokenization: | ||
| if ( | ||
| self.user_config.endpoint.use_server_token_count | ||
| and not self._image_hybrid_info_logged | ||
| ): | ||
| self.info( | ||
| "Non-text input modalities detected (images). Input token counting " | ||
| "will be supplemented with client-side text tokenization to derive " | ||
| "image token counts. Output and reasoning token counts will continue " | ||
| "to use server-reported values." | ||
| ) | ||
| self._image_hybrid_info_logged = True | ||
| token_counts = await self._compute_client_side_token_counts( | ||
| request_record, resp | ||
| ) |
There was a problem hiding this comment.
Info log fires in the wrong case; the automatic hybrid switch is silent.
The one-time info message (lines 221–227) only triggers when use_server_token_count is already True. But that's the case where the user explicitly asked for server counts — the hybrid augmentation is a minor tweak to their expectation.
The more important case is when use_server_token_count is False (the default): the user expects pure client-side tokenization, but the code silently switches ISL to the server-reported total. A one-time info log here would help users understand why their ISL values differ from pure client tokenization.
Proposed fix
if has_images and self._endpoint_supports_text_tokenization:
if (
self.user_config.endpoint.use_server_token_count
and not self._image_hybrid_info_logged
):
self.info(
"Non-text input modalities detected (images). Input token counting "
"will be supplemented with client-side text tokenization to derive "
"image token counts. Output and reasoning token counts will continue "
"to use server-reported values."
)
self._image_hybrid_info_logged = True
+ elif (
+ not self.user_config.endpoint.use_server_token_count
+ and not self._image_hybrid_info_logged
+ ):
+ self.info(
+ "Non-text input modalities detected (images). ISL will use "
+ "server-reported usage.prompt_tokens instead of client-side "
+ "tokenization to include image tokens. Per-modality breakdowns "
+ "are available via image_input_token_count and text_input_token_count."
+ )
+ self._image_hybrid_info_logged = True
token_counts = await self._compute_client_side_token_counts(
request_record, resp
)🤖 Prompt for AI Agents
In `@src/aiperf/records/inference_result_parser.py` around lines 216 - 230, The
info log about hybrid token counting is triggered only when
user_config.endpoint.use_server_token_count is true, but it should run when the
code silently augments client-side tokenization (i.e., when
use_server_token_count is false) so users requesting client-side counts see the
one-time notice; update the conditional in the method that contains
_endpoint_supports_text_tokenization and _compute_client_side_token_counts so
the info() message is logged when has_images and
self._endpoint_supports_text_tokenization and not self._image_hybrid_info_logged
and NOT self.user_config.endpoint.use_server_token_count (and still set
self._image_hybrid_info_logged = True), and adjust the log text to clearly state
that client-side tokenization is being supplemented by server-reported totals
for output/reasoning while image token counts are derived client-side; keep the
existing branch for the explicit use_server_token_count case unchanged or
provide a distinct message if desired.
There was a problem hiding this comment.
There is a pending discussion arguing in favor of always using server token counts.
Creating this thread as a hold for the PR.
--use-server-token-countis automatically supplemented when images are detected — no user action neededNoneotherwise).Summary by CodeRabbit
New Features
Documentation
Tests