feat: Implement relevant metrics involving input image token count by matthewkotila · Pull Request #667 · ai-dynamo/aiperf

matthewkotila · 2026-02-11T01:52:00Z

When images are in the input, aiperf now counts image tokens separately from text tokens
- It asks the server "how many total input tokens?" and the local tokenizer "how any text tokens?", then subtracts to get the image token count
Six new metrics for image token analysis:
- per-request image/text token counts (default file-only)
- tokens per image (default file-only)
- image-to-total token ratio (default file-only)
- total image tokens across the benchmark (default file-only)
- image input token throughput (tokens/sec) (default CONSOLE)
--use-server-token-count is automatically supplemented when images are detected — no user action needed
- aiperf detects images and switches to the hybrid counting approach on its own.
New fields on TokenCounts: text_input and image_input (only populated when images are present; None otherwise).
Wording is future-proofed for other non-text modalities (e.g., audio, video) so the logic and docs won't need rewriting when those are added.
45+ new unit tests covering the hybrid token counting, all six metrics, edge cases (clamping, missing usage, streaming), and integration scenarios.

Summary by CodeRabbit

New Features
- Added image token metrics: image input tokens, text input tokens, tokens per image, image token ratio, total image input tokens, and image input token throughput.
- Parser now supports hybrid image-aware token counting and reports a single informational message when that mode is used.
Documentation
- Clarified multimodal token-counting across CLI, metrics reference, and metric descriptions; explains server vs. client token sources and ISL behavior with images.
Tests
- Added unit tests for image token metrics and image-aware parsing behaviors.

github-actions · 2026-02-11T01:52:11Z

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@c95a0d45483c60dac56514631b6e373015a8aa2f

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@c95a0d45483c60dac56514631b6e373015a8aa2f

Last updated for commit: c95a0d4 • Browse code

codecov · 2026-02-11T01:59:17Z

Codecov Report

❌ Patch coverage is 98.26087% with 2 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/aiperf/metrics/types/image_token_metrics.py	97.56%	1 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

coderabbitai · 2026-02-11T02:00:03Z

Walkthrough

Adds client/server hybrid token counting for multimodal requests (images), introduces per-record text_input and image_input fields, six new image-token metrics (counts, ratios, throughput), updates docs/config, and adds comprehensive unit tests for metrics and parser behavior.

Changes

Cohort / File(s)	Summary
Documentation `docs/cli_options.md`, `docs/metrics_reference.md`	Notes and TOC updates describing hybrid token-counting for non-text modalities; added six image-token metrics and clarified ISL and throughput semantics for multimodal input.
Config & Model docs `src/aiperf/common/config/endpoint_config.py`, `src/aiperf/common/models/record_models.py`	Expanded `use_server_token_count` description; added `text_input: int
Metrics implementation `src/aiperf/metrics/types/image_token_metrics.py`, `src/aiperf/metrics/types/input_sequence_length_metric.py`, `src/aiperf/metrics/types/total_token_throughput.py`	New module with six image-token metrics (ImageInputTokenCount, TextInputTokenCount, TokensPerImage, ImageTokenRatio, TotalImageInputTokens, ImageInputTokenThroughput); updated ISL and throughput docstrings to reference multimodal counts.
Parser logic `src/aiperf/records/inference_result_parser.py`	Added image detection, endpoint capability flag, single-info-log guard, and hybrid counting: when images present and endpoint supports text tokenization, use server `usage.prompt_tokens` as total and derive `image_input = prompt_tokens - client_text_tokens` (clamped); includes warnings and fallbacks for missing server usage.
Tests `tests/unit/metrics/test_image_token_metrics.py`, `tests/unit/records/test_inference_result_parser_image_tokens.py`	Comprehensive unit tests for per-record image/text token metrics, derived sums/ratios/throughput, parser hybrid behavior, edge cases (missing server usage, zero images/duration), multi-turn and streaming scenarios.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I nibble tokens, count with care,
Text or image — split them fair.
Six new metrics, parser bright,
Tests ensure the math is right.
Hop on, report, and out of sight!

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main feature: implementing metrics for image token counting, which is the primary focus across all modified files.
Docstring Coverage	✅ Passed	Docstring coverage is 88.68% which is sufficient. The required threshold is 80.00%.
Merge Conflict Detection	✅ Passed	✅ No merge conflicts detected when merging into `main`

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/aiperf/records/inference_result_parser.py (1)

386-435: ⚠️ Potential issue | 🟠 Major

Fix hybrid counting for image-only prompts and honor server output counts when requested.

Two edge cases here:

For image-only requests, text_input_token_count becomes None, so server prompt tokens are ignored and a misleading “missing usage” warning is logged. This drops image-token metrics even when usage.prompt_tokens exists.
When --use-server-token-count is enabled, this path still uses client-side output/reasoning tokenization, which breaks the flag’s semantics and can skew metrics for users relying on server counts.

Consider treating missing text as 0 when images are present and preferring server output/reasoning counts when the flag is set (fallback to client if server usage is missing).

🐛 Suggested fix for hybrid counting edge cases

         output_texts, reasoning_texts = self._parse_output_and_reasoning_texts(
             responses
         )
         output_token_count = self._compute_token_count(tokenizer, output_texts)
         reasoning_token_count = self._compute_token_count(tokenizer, reasoning_texts)
+        if self.user_config.endpoint.use_server_token_count:
+            server_reasoning = self._extract_server_reasoning_token_count(responses)
+            server_output = self._extract_server_output_token_count(
+                responses, server_reasoning
+            )
+            if server_reasoning is not None:
+                reasoning_token_count = server_reasoning
+            if server_output is not None:
+                output_token_count = server_output
 
         if self._request_has_images(request_record):
             server_prompt_tokens = self._extract_server_input_token_count(responses)
-            if server_prompt_tokens is not None and text_input_token_count is not None:
-                image_input_token_count = server_prompt_tokens - text_input_token_count
+            if server_prompt_tokens is not None:
+                effective_text_input = text_input_token_count or 0
+                image_input_token_count = server_prompt_tokens - effective_text_input
                 if image_input_token_count < 0:
                     self.warning(
                         f"Server reported fewer prompt tokens ({server_prompt_tokens}) than "
                         f"client-side text token count ({text_input_token_count}). "
                         "Clamping image input tokens to 0."
                     )
                     image_input_token_count = 0
                 return TokenCounts(
                     input=server_prompt_tokens,
-                    text_input=text_input_token_count,
+                    text_input=effective_text_input,
                     image_input=image_input_token_count,
                     reasoning=reasoning_token_count,
                     output=output_token_count,
                 )
-            else:
-                self.warning(
-                    "Images detected in input but server did not report usage.prompt_tokens. "
-                    "ISL will reflect text tokens only; image token count is unknown."
-                )
+            self.warning(
+                "Images detected in input but server did not report usage.prompt_tokens. "
+                "ISL will reflect text tokens only; image token count is unknown."
+            )

🤖 Fix all issues with AI agents

In `@tests/unit/metrics/test_image_token_metrics.py`:
- Around line 98-165: Rename the test methods in class
TestImageInputTokenCountMetric to follow test_<function>_<scenario>_<expected>;
e.g., rename test_reads_image_input_from_token_counts to
test_image_input_token_count_metric_reads_image_input_tokens_returns_count,
test_various_image_token_counts to
test_image_input_token_count_metric_various_counts_return_same,
test_multiple_records to
test_image_input_token_count_metric_multiple_records_preserves_order,
test_no_image_input_raises to
test_image_input_token_count_metric_no_image_input_raises_NoMetricValue,
test_no_token_counts_raises to
test_image_input_token_count_metric_no_token_counts_raises_NoMetricValue, and
test_zero_image_tokens to
test_image_input_token_count_metric_zero_image_tokens_returns_zero so the names
reference ImageInputTokenCountMetric / run_image_token_pipeline and match the
mandated pattern.

In `@tests/unit/records/test_inference_result_parser_image_tokens.py`:
- Around line 94-151: The test methods in TestRequestHasImages do not follow the
required naming convention test_<function>_<scenario>_<expected>; rename each
async test method to reflect the function _request_has_images, the scenario, and
the expected result (e.g., test_request_has_images_detects_images_true for
test_detects_images, test_request_has_images_no_images_false for test_no_images,
test_request_has_images_empty_image_contents_false,
test_request_has_images_no_turns_false, and
test_request_has_images_multiple_turns_with_images_true) so the class
TestRequestHasImages and its methods (e.g., test_detects_images, test_no_images,
test_empty_image_contents, test_no_turns, test_multiple_turns_with_images) match
the repo rule.

tests/unit/metrics/test_image_token_metrics.py

tests/unit/records/test_inference_result_parser_image_tokens.py

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@src/aiperf/records/inference_result_parser.py`:
- Around line 407-430: The current else lumps two failure modes together; change
the logic in inference_result_parser.py around _request_has_images(),
_extract_server_input_token_count(), and TokenCounts so you handle them
separately: (1) if server_prompt_tokens is not None but text_input_token_count
is None, log a specific warning that client tokenization failed and return a
partial TokenCounts with input set to server_prompt_tokens and other per-field
values set to sensible partial/unknown values (e.g., text_input=0 or None and
image_input left unknown/None) so callers get the server-reported total; (2) if
server_prompt_tokens is None (regardless of text tokenization), keep the
existing warning that server did not report usage.prompt_tokens and do not
return server-derived totals. Use the existing symbols
(_extract_server_input_token_count, text_input_token_count, TokenCounts, and
self.warning) to find where to implement the split.

🧹 Nitpick comments (1)

src/aiperf/metrics/types/image_token_metrics.py (1)

132-165: LARGER_IS_BETTER on ImageTokenRatioMetric — intentional?

A higher image-to-total token ratio isn't inherently "better" — it's a descriptive metric indicating composition. This flag typically guides optimization direction in dashboards. Worth confirming this is the intended display behavior, though it has no functional impact.

src/aiperf/records/inference_result_parser.py

…bit feedback

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

docs/metrics_reference.md (1)

459-461: ⚠️ Potential issue | 🟡 Minor

Section note may mislead users about the new image-token metrics.

The existing note says these metrics require "image-capable endpoints (e.g., image generation APIs)." However, the six new image-token metrics (lines 494–594) are for image input scenarios (vision-language models receiving images as input), not image generation. The existing Image Throughput/Latency metrics are for generation endpoints, so the note is only accurate for those two.

Consider splitting the section or updating the note to clarify that the image-token metrics apply to endpoints that accept images as input (e.g., VLMs), while Image Throughput/Latency apply to image-generation endpoints.

🤖 Fix all issues with AI agents

In `@src/aiperf/records/inference_result_parser.py`:
- Around line 216-230: The info log about hybrid token counting is triggered
only when user_config.endpoint.use_server_token_count is true, but it should run
when the code silently augments client-side tokenization (i.e., when
use_server_token_count is false) so users requesting client-side counts see the
one-time notice; update the conditional in the method that contains
_endpoint_supports_text_tokenization and _compute_client_side_token_counts so
the info() message is logged when has_images and
self._endpoint_supports_text_tokenization and not self._image_hybrid_info_logged
and NOT self.user_config.endpoint.use_server_token_count (and still set
self._image_hybrid_info_logged = True), and adjust the log text to clearly state
that client-side tokenization is being supplemented by server-reported totals
for output/reasoning while image token counts are derived client-side; keep the
existing branch for the explicit use_server_token_count case unchanged or
provide a distinct message if desired.

🧹 Nitpick comments (1)

tests/unit/metrics/test_image_token_metrics.py (1)

171-178: Test names in remaining classes don't follow the mandated naming convention.

TestTextInputTokenCountMetric, TestTokensPerImageMetric, TestImageTokenRatioMetric, TestTotalImageInputTokensMetric, TestImageInputTokenThroughputMetric, and TestImageTokenMetricsIntegration contain test methods that don't follow the test_<function>_<scenario>_<expected> pattern. For example, test_reads_text_input_from_token_counts should be something like test_text_input_token_count_metric_reads_text_input_returns_count.

The naming was fixed for TestImageInputTokenCountMetric but the same pattern wasn't applied to the other test classes.

As per coding guidelines: "Name tests as test_<function>_<scenario>_<expected> (e.g., test_parse_config_missing_field_raises_error)."

coderabbitai · 2026-02-12T22:35:20Z

src/aiperf/records/inference_result_parser.py

+        if has_images and self._endpoint_supports_text_tokenization:
+            if (
+                self.user_config.endpoint.use_server_token_count
+                and not self._image_hybrid_info_logged
+            ):
+                self.info(
+                    "Non-text input modalities detected (images). Input token counting "
+                    "will be supplemented with client-side text tokenization to derive "
+                    "image token counts. Output and reasoning token counts will continue "
+                    "to use server-reported values."
+                )
+                self._image_hybrid_info_logged = True
+            token_counts = await self._compute_client_side_token_counts(
+                request_record, resp
+            )


⚠️ Potential issue | 🟡 Minor

Info log fires in the wrong case; the automatic hybrid switch is silent.

The one-time info message (lines 221–227) only triggers when use_server_token_count is already True. But that's the case where the user explicitly asked for server counts — the hybrid augmentation is a minor tweak to their expectation.

The more important case is when use_server_token_count is False (the default): the user expects pure client-side tokenization, but the code silently switches ISL to the server-reported total. A one-time info log here would help users understand why their ISL values differ from pure client tokenization.

Proposed fix

if has_images and self._endpoint_supports_text_tokenization: if ( self.user_config.endpoint.use_server_token_count and not self._image_hybrid_info_logged ): self.info( "Non-text input modalities detected (images). Input token counting " "will be supplemented with client-side text tokenization to derive " "image token counts. Output and reasoning token counts will continue " "to use server-reported values." ) self._image_hybrid_info_logged = True + elif ( + not self.user_config.endpoint.use_server_token_count + and not self._image_hybrid_info_logged + ): + self.info( + "Non-text input modalities detected (images). ISL will use " + "server-reported usage.prompt_tokens instead of client-side " + "tokenization to include image tokens. Per-modality breakdowns " + "are available via image_input_token_count and text_input_token_count." + ) + self._image_hybrid_info_logged = True token_counts = await self._compute_client_side_token_counts( request_record, resp )

🤖 Prompt for AI Agents

In `@src/aiperf/records/inference_result_parser.py` around lines 216 - 230, The info log about hybrid token counting is triggered only when user_config.endpoint.use_server_token_count is true, but it should run when the code silently augments client-side tokenization (i.e., when use_server_token_count is false) so users requesting client-side counts see the one-time notice; update the conditional in the method that contains _endpoint_supports_text_tokenization and _compute_client_side_token_counts so the info() message is logged when has_images and self._endpoint_supports_text_tokenization and not self._image_hybrid_info_logged and NOT self.user_config.endpoint.use_server_token_count (and still set self._image_hybrid_info_logged = True), and adjust the log text to clearly state that client-side tokenization is being supplemented by server-reported totals for output/reasoning while image token counts are derived client-side; keep the existing branch for the explicit use_server_token_count case unchanged or provide a distinct message if desired.

matthewkotila · 2026-02-13T00:21:43Z

docs/cli_options.md

There is a pending discussion arguing in favor of always using server token counts.

Creating this thread as a hold for the PR.

matthewkotila requested a review from ajcasagrande February 11, 2026 01:52

github-actions bot added the feat label Feb 11, 2026

coderabbitai bot reviewed Feb 11, 2026

View reviewed changes

tests/unit/metrics/test_image_token_metrics.py Show resolved Hide resolved

tests/unit/records/test_inference_result_parser_image_tokens.py Show resolved Hide resolved

matthewkotila force-pushed the mkotila/aip-725-implement-relevant-metrics-involving-input-image-token-count branch from 206333d to 88c430e Compare February 11, 2026 17:59

coderabbitai bot reviewed Feb 11, 2026

View reviewed changes

src/aiperf/records/inference_result_parser.py Show resolved Hide resolved

matthewkotila force-pushed the mkotila/aip-725-implement-relevant-metrics-involving-input-image-token-count branch from 88c430e to 3c4d82a Compare February 12, 2026 02:12

matthewkotila added 2 commits February 12, 2026 14:28

feat: Implement relevant metrics involving input image token count

37a93e0

Fix reasoning/output server token count usage bug and address coderab…

c95a0d4

…bit feedback

matthewkotila force-pushed the mkotila/aip-725-implement-relevant-metrics-involving-input-image-token-count branch from 3c4d82a to c95a0d4 Compare February 12, 2026 22:30

coderabbitai bot reviewed Feb 12, 2026

View reviewed changes

matthewkotila commented Feb 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Implement relevant metrics involving input image token count#667

feat: Implement relevant metrics involving input image token count#667
matthewkotila wants to merge 2 commits intomainfrom
mkotila/aip-725-implement-relevant-metrics-involving-input-image-token-count

matthewkotila commented Feb 11, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

github-actions bot commented Feb 11, 2026 •

edited

Loading

Uh oh!

codecov bot commented Feb 11, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Feb 11, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 12, 2026

Uh oh!

matthewkotila Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

matthewkotila commented Feb 11, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

github-actions bot commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Try out this PR

Uh oh!

codecov bot commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

matthewkotila Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

matthewkotila commented Feb 11, 2026 •

edited by coderabbitai bot

Loading

github-actions bot commented Feb 11, 2026 •

edited

Loading

codecov bot commented Feb 11, 2026 •

edited

Loading

coderabbitai bot commented Feb 11, 2026 •

edited

Loading