Skip to content

feat: Implement relevant metrics involving input image token count#667

Open
matthewkotila wants to merge 2 commits intomainfrom
mkotila/aip-725-implement-relevant-metrics-involving-input-image-token-count
Open

feat: Implement relevant metrics involving input image token count#667
matthewkotila wants to merge 2 commits intomainfrom
mkotila/aip-725-implement-relevant-metrics-involving-input-image-token-count

Conversation

@matthewkotila
Copy link
Contributor

@matthewkotila matthewkotila commented Feb 11, 2026

  • When images are in the input, aiperf now counts image tokens separately from text tokens
    • It asks the server "how many total input tokens?" and the local tokenizer "how any text tokens?", then subtracts to get the image token count
  • Six new metrics for image token analysis:
    • per-request image/text token counts (default file-only)
    • tokens per image (default file-only)
    • image-to-total token ratio (default file-only)
    • total image tokens across the benchmark (default file-only)
    • image input token throughput (tokens/sec) (default CONSOLE)
  • --use-server-token-count is automatically supplemented when images are detected — no user action needed
    • aiperf detects images and switches to the hybrid counting approach on its own.
  • New fields on TokenCounts: text_input and image_input (only populated when images are present; None otherwise).
  • Wording is future-proofed for other non-text modalities (e.g., audio, video) so the logic and docs won't need rewriting when those are added.
  • 45+ new unit tests covering the hybrid token counting, all six metrics, edge cases (clamping, missing usage, streaming), and integration scenarios.

Summary by CodeRabbit

  • New Features

    • Added image token metrics: image input tokens, text input tokens, tokens per image, image token ratio, total image input tokens, and image input token throughput.
    • Parser now supports hybrid image-aware token counting and reports a single informational message when that mode is used.
  • Documentation

    • Clarified multimodal token-counting across CLI, metrics reference, and metric descriptions; explains server vs. client token sources and ISL behavior with images.
  • Tests

    • Added unit tests for image token metrics and image-aware parsing behaviors.

@github-actions
Copy link

github-actions bot commented Feb 11, 2026

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@c95a0d45483c60dac56514631b6e373015a8aa2f

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@c95a0d45483c60dac56514631b6e373015a8aa2f

Last updated for commit: c95a0d4Browse code

@codecov
Copy link

codecov bot commented Feb 11, 2026

Codecov Report

❌ Patch coverage is 98.26087% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/aiperf/metrics/types/image_token_metrics.py 97.56% 1 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

@coderabbitai
Copy link

coderabbitai bot commented Feb 11, 2026

Walkthrough

Adds client/server hybrid token counting for multimodal requests (images), introduces per-record text_input and image_input fields, six new image-token metrics (counts, ratios, throughput), updates docs/config, and adds comprehensive unit tests for metrics and parser behavior.

Changes

Cohort / File(s) Summary
Documentation
docs/cli_options.md, docs/metrics_reference.md
Notes and TOC updates describing hybrid token-counting for non-text modalities; added six image-token metrics and clarified ISL and throughput semantics for multimodal input.
Config & Model docs
src/aiperf/common/config/endpoint_config.py, src/aiperf/common/models/record_models.py
Expanded use_server_token_count description; added `text_input: int
Metrics implementation
src/aiperf/metrics/types/image_token_metrics.py, src/aiperf/metrics/types/input_sequence_length_metric.py, src/aiperf/metrics/types/total_token_throughput.py
New module with six image-token metrics (ImageInputTokenCount, TextInputTokenCount, TokensPerImage, ImageTokenRatio, TotalImageInputTokens, ImageInputTokenThroughput); updated ISL and throughput docstrings to reference multimodal counts.
Parser logic
src/aiperf/records/inference_result_parser.py
Added image detection, endpoint capability flag, single-info-log guard, and hybrid counting: when images present and endpoint supports text tokenization, use server usage.prompt_tokens as total and derive image_input = prompt_tokens - client_text_tokens (clamped); includes warnings and fallbacks for missing server usage.
Tests
tests/unit/metrics/test_image_token_metrics.py, tests/unit/records/test_inference_result_parser_image_tokens.py
Comprehensive unit tests for per-record image/text token metrics, derived sums/ratios/throughput, parser hybrid behavior, edge cases (missing server usage, zero images/duration), multi-turn and streaming scenarios.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I nibble tokens, count with care,
Text or image — split them fair.
Six new metrics, parser bright,
Tests ensure the math is right.
Hop on, report, and out of sight!

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main feature: implementing metrics for image token counting, which is the primary focus across all modified files.
Docstring Coverage ✅ Passed Docstring coverage is 88.68% which is sufficient. The required threshold is 80.00%.
Merge Conflict Detection ✅ Passed ✅ No merge conflicts detected when merging into main

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/aiperf/records/inference_result_parser.py (1)

386-435: ⚠️ Potential issue | 🟠 Major

Fix hybrid counting for image-only prompts and honor server output counts when requested.

Two edge cases here:

  • For image-only requests, text_input_token_count becomes None, so server prompt tokens are ignored and a misleading “missing usage” warning is logged. This drops image-token metrics even when usage.prompt_tokens exists.
  • When --use-server-token-count is enabled, this path still uses client-side output/reasoning tokenization, which breaks the flag’s semantics and can skew metrics for users relying on server counts.

Consider treating missing text as 0 when images are present and preferring server output/reasoning counts when the flag is set (fallback to client if server usage is missing).

🐛 Suggested fix for hybrid counting edge cases
         output_texts, reasoning_texts = self._parse_output_and_reasoning_texts(
             responses
         )
         output_token_count = self._compute_token_count(tokenizer, output_texts)
         reasoning_token_count = self._compute_token_count(tokenizer, reasoning_texts)
+        if self.user_config.endpoint.use_server_token_count:
+            server_reasoning = self._extract_server_reasoning_token_count(responses)
+            server_output = self._extract_server_output_token_count(
+                responses, server_reasoning
+            )
+            if server_reasoning is not None:
+                reasoning_token_count = server_reasoning
+            if server_output is not None:
+                output_token_count = server_output
 
         if self._request_has_images(request_record):
             server_prompt_tokens = self._extract_server_input_token_count(responses)
-            if server_prompt_tokens is not None and text_input_token_count is not None:
-                image_input_token_count = server_prompt_tokens - text_input_token_count
+            if server_prompt_tokens is not None:
+                effective_text_input = text_input_token_count or 0
+                image_input_token_count = server_prompt_tokens - effective_text_input
                 if image_input_token_count < 0:
                     self.warning(
                         f"Server reported fewer prompt tokens ({server_prompt_tokens}) than "
                         f"client-side text token count ({text_input_token_count}). "
                         "Clamping image input tokens to 0."
                     )
                     image_input_token_count = 0
                 return TokenCounts(
                     input=server_prompt_tokens,
-                    text_input=text_input_token_count,
+                    text_input=effective_text_input,
                     image_input=image_input_token_count,
                     reasoning=reasoning_token_count,
                     output=output_token_count,
                 )
-            else:
-                self.warning(
-                    "Images detected in input but server did not report usage.prompt_tokens. "
-                    "ISL will reflect text tokens only; image token count is unknown."
-                )
+            self.warning(
+                "Images detected in input but server did not report usage.prompt_tokens. "
+                "ISL will reflect text tokens only; image token count is unknown."
+            )
🤖 Fix all issues with AI agents
In `@tests/unit/metrics/test_image_token_metrics.py`:
- Around line 98-165: Rename the test methods in class
TestImageInputTokenCountMetric to follow test_<function>_<scenario>_<expected>;
e.g., rename test_reads_image_input_from_token_counts to
test_image_input_token_count_metric_reads_image_input_tokens_returns_count,
test_various_image_token_counts to
test_image_input_token_count_metric_various_counts_return_same,
test_multiple_records to
test_image_input_token_count_metric_multiple_records_preserves_order,
test_no_image_input_raises to
test_image_input_token_count_metric_no_image_input_raises_NoMetricValue,
test_no_token_counts_raises to
test_image_input_token_count_metric_no_token_counts_raises_NoMetricValue, and
test_zero_image_tokens to
test_image_input_token_count_metric_zero_image_tokens_returns_zero so the names
reference ImageInputTokenCountMetric / run_image_token_pipeline and match the
mandated pattern.

In `@tests/unit/records/test_inference_result_parser_image_tokens.py`:
- Around line 94-151: The test methods in TestRequestHasImages do not follow the
required naming convention test_<function>_<scenario>_<expected>; rename each
async test method to reflect the function _request_has_images, the scenario, and
the expected result (e.g., test_request_has_images_detects_images_true for
test_detects_images, test_request_has_images_no_images_false for test_no_images,
test_request_has_images_empty_image_contents_false,
test_request_has_images_no_turns_false, and
test_request_has_images_multiple_turns_with_images_true) so the class
TestRequestHasImages and its methods (e.g., test_detects_images, test_no_images,
test_empty_image_contents, test_no_turns, test_multiple_turns_with_images) match
the repo rule.

@matthewkotila matthewkotila force-pushed the mkotila/aip-725-implement-relevant-metrics-involving-input-image-token-count branch from 206333d to 88c430e Compare February 11, 2026 17:59
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/aiperf/records/inference_result_parser.py`:
- Around line 407-430: The current else lumps two failure modes together; change
the logic in inference_result_parser.py around _request_has_images(),
_extract_server_input_token_count(), and TokenCounts so you handle them
separately: (1) if server_prompt_tokens is not None but text_input_token_count
is None, log a specific warning that client tokenization failed and return a
partial TokenCounts with input set to server_prompt_tokens and other per-field
values set to sensible partial/unknown values (e.g., text_input=0 or None and
image_input left unknown/None) so callers get the server-reported total; (2) if
server_prompt_tokens is None (regardless of text tokenization), keep the
existing warning that server did not report usage.prompt_tokens and do not
return server-derived totals. Use the existing symbols
(_extract_server_input_token_count, text_input_token_count, TokenCounts, and
self.warning) to find where to implement the split.
🧹 Nitpick comments (1)
src/aiperf/metrics/types/image_token_metrics.py (1)

132-165: LARGER_IS_BETTER on ImageTokenRatioMetric — intentional?

A higher image-to-total token ratio isn't inherently "better" — it's a descriptive metric indicating composition. This flag typically guides optimization direction in dashboards. Worth confirming this is the intended display behavior, though it has no functional impact.

@matthewkotila matthewkotila force-pushed the mkotila/aip-725-implement-relevant-metrics-involving-input-image-token-count branch from 88c430e to 3c4d82a Compare February 12, 2026 02:12
@matthewkotila matthewkotila force-pushed the mkotila/aip-725-implement-relevant-metrics-involving-input-image-token-count branch from 3c4d82a to c95a0d4 Compare February 12, 2026 22:30
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
docs/metrics_reference.md (1)

459-461: ⚠️ Potential issue | 🟡 Minor

Section note may mislead users about the new image-token metrics.

The existing note says these metrics require "image-capable endpoints (e.g., image generation APIs)." However, the six new image-token metrics (lines 494–594) are for image input scenarios (vision-language models receiving images as input), not image generation. The existing Image Throughput/Latency metrics are for generation endpoints, so the note is only accurate for those two.

Consider splitting the section or updating the note to clarify that the image-token metrics apply to endpoints that accept images as input (e.g., VLMs), while Image Throughput/Latency apply to image-generation endpoints.

🤖 Fix all issues with AI agents
In `@src/aiperf/records/inference_result_parser.py`:
- Around line 216-230: The info log about hybrid token counting is triggered
only when user_config.endpoint.use_server_token_count is true, but it should run
when the code silently augments client-side tokenization (i.e., when
use_server_token_count is false) so users requesting client-side counts see the
one-time notice; update the conditional in the method that contains
_endpoint_supports_text_tokenization and _compute_client_side_token_counts so
the info() message is logged when has_images and
self._endpoint_supports_text_tokenization and not self._image_hybrid_info_logged
and NOT self.user_config.endpoint.use_server_token_count (and still set
self._image_hybrid_info_logged = True), and adjust the log text to clearly state
that client-side tokenization is being supplemented by server-reported totals
for output/reasoning while image token counts are derived client-side; keep the
existing branch for the explicit use_server_token_count case unchanged or
provide a distinct message if desired.
🧹 Nitpick comments (1)
tests/unit/metrics/test_image_token_metrics.py (1)

171-178: Test names in remaining classes don't follow the mandated naming convention.

TestTextInputTokenCountMetric, TestTokensPerImageMetric, TestImageTokenRatioMetric, TestTotalImageInputTokensMetric, TestImageInputTokenThroughputMetric, and TestImageTokenMetricsIntegration contain test methods that don't follow the test_<function>_<scenario>_<expected> pattern. For example, test_reads_text_input_from_token_counts should be something like test_text_input_token_count_metric_reads_text_input_returns_count.

The naming was fixed for TestImageInputTokenCountMetric but the same pattern wasn't applied to the other test classes.

As per coding guidelines: "Name tests as test_<function>_<scenario>_<expected> (e.g., test_parse_config_missing_field_raises_error)."

Comment on lines +216 to +230
if has_images and self._endpoint_supports_text_tokenization:
if (
self.user_config.endpoint.use_server_token_count
and not self._image_hybrid_info_logged
):
self.info(
"Non-text input modalities detected (images). Input token counting "
"will be supplemented with client-side text tokenization to derive "
"image token counts. Output and reasoning token counts will continue "
"to use server-reported values."
)
self._image_hybrid_info_logged = True
token_counts = await self._compute_client_side_token_counts(
request_record, resp
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Info log fires in the wrong case; the automatic hybrid switch is silent.

The one-time info message (lines 221–227) only triggers when use_server_token_count is already True. But that's the case where the user explicitly asked for server counts — the hybrid augmentation is a minor tweak to their expectation.

The more important case is when use_server_token_count is False (the default): the user expects pure client-side tokenization, but the code silently switches ISL to the server-reported total. A one-time info log here would help users understand why their ISL values differ from pure client tokenization.

Proposed fix
         if has_images and self._endpoint_supports_text_tokenization:
             if (
                 self.user_config.endpoint.use_server_token_count
                 and not self._image_hybrid_info_logged
             ):
                 self.info(
                     "Non-text input modalities detected (images). Input token counting "
                     "will be supplemented with client-side text tokenization to derive "
                     "image token counts. Output and reasoning token counts will continue "
                     "to use server-reported values."
                 )
                 self._image_hybrid_info_logged = True
+            elif (
+                not self.user_config.endpoint.use_server_token_count
+                and not self._image_hybrid_info_logged
+            ):
+                self.info(
+                    "Non-text input modalities detected (images). ISL will use "
+                    "server-reported usage.prompt_tokens instead of client-side "
+                    "tokenization to include image tokens. Per-modality breakdowns "
+                    "are available via image_input_token_count and text_input_token_count."
+                )
+                self._image_hybrid_info_logged = True
             token_counts = await self._compute_client_side_token_counts(
                 request_record, resp
             )
🤖 Prompt for AI Agents
In `@src/aiperf/records/inference_result_parser.py` around lines 216 - 230, The
info log about hybrid token counting is triggered only when
user_config.endpoint.use_server_token_count is true, but it should run when the
code silently augments client-side tokenization (i.e., when
use_server_token_count is false) so users requesting client-side counts see the
one-time notice; update the conditional in the method that contains
_endpoint_supports_text_tokenization and _compute_client_side_token_counts so
the info() message is logged when has_images and
self._endpoint_supports_text_tokenization and not self._image_hybrid_info_logged
and NOT self.user_config.endpoint.use_server_token_count (and still set
self._image_hybrid_info_logged = True), and adjust the log text to clearly state
that client-side tokenization is being supplemented by server-reported totals
for output/reasoning while image token counts are derived client-side; keep the
existing branch for the explicit use_server_token_count case unchanged or
provide a distinct message if desired.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a pending discussion arguing in favor of always using server token counts.

Creating this thread as a hold for the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments