Skip to content

docs: Add docs for synthetic dataset#687

Open
lkomali wants to merge 5 commits intomainfrom
lkomali/clarify-synthetic-datasets
Open

docs: Add docs for synthetic dataset#687
lkomali wants to merge 5 commits intomainfrom
lkomali/clarify-synthetic-datasets

Conversation

@lkomali
Copy link
Contributor

@lkomali lkomali commented Feb 18, 2026

Summary by CodeRabbit

  • Documentation
    • Added a comprehensive tutorial for synthetic dataset generation across text, image, audio, and video modalities with workflows, configuration examples, dataset loader guidance, and CLI usage.
    • Updated vision endpoint docs with improved setup and readiness probing, profiling notes, minor structure tweaks, and copyright year update.
  • Chores
    • Exposed media tooling paths in the build environment so image/audio/video tooling is available during builds.

@github-actions
Copy link

github-actions bot commented Feb 18, 2026

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@a4449ec5b350b6eb31b91a877a2e14537f1a4cfd

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@a4449ec5b350b6eb31b91a877a2e14537f1a4cfd

Last updated for commit: a4449ecBrowse code

@coderabbitai
Copy link

coderabbitai bot commented Feb 18, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉


Walkthrough

Adds a new synthetic-dataset tutorial covering text, image, and audio generation (video redirected), updates the vision tutorial (year, readiness probe loop, HTML-commented profiling/setup sections), and exports FFmpeg paths in the Dockerfile env-builder stage.

Changes

Cohort / File(s) Summary
Synthetic dataset tutorial
docs/tutorials/synthetic-dataset.md
New tutorial describing modality-specific generators, configuration options, deterministic sampling and reproducibility notes, workflows for text/image/audio, CLI examples, random-pool loader example, and guidance pointing video to a separate file.
Vision tutorial edits
docs/tutorials/vision.md
SPDX year updated; endpoint setup, readiness check, and profiling/usage sections wrapped in HTML comments; replaced single-shot curl check with looped polling until HTTP 200; minor spacing and structural adjustments.
Build environment
Dockerfile
Adds ENV PATH="/opt/ffmpeg/bin:${PATH}" and ENV LD_LIBRARY_PATH="/opt/ffmpeg/lib" in the env-builder stage to expose FFmpeg binaries and libraries to subsequent stages.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐰 I nibble prompts and paint in bytes,

I hum a tone and stitch the lights,
I loop until the service sings,
I tuck FFmpeg under wings,
Hop—docs and builds take flight.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title accurately summarizes the main change: adding documentation for synthetic dataset generation. The title is concise, clear, and directly reflects the primary addition of docs/tutorials/synthetic-dataset.md.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/tutorials/synthetic-dataset.md`:
- Around line 93-106: The image example code block for the aiperf profile
invocation is missing the CI test markers used elsewhere; update the
synthetic-dataset.md image section by adding the appropriate <!-- aiperf-run-...
--> marker pair around the image bash block to match the text and audio examples
(or, if images are tested differently, add a short comment immediately above the
block explaining the alternate testing mechanism). Locate the image block (the
bash snippet invoking aiperf with --model Qwen/Qwen2-VL-2B-Instruct and image-*
flags) and either wrap it with the same marker format used for the other
sections or add the explanatory comment so the documentation and CI behavior are
consistent.
- Around line 40-51: The two aiperf code-block tags are identical; update the
prefix-synthesis block so its HTML comment markers use a distinct tag: change
both the opening comment <!-- aiperf-run-vllm-default-openai-endpoint-server -->
and the closing <!-- /aiperf-run-vllm-default-openai-endpoint-server --> that
surround the prefix example to <!--
aiperf-run-vllm-prefix-openai-endpoint-server --> and <!--
/aiperf-run-vllm-prefix-openai-endpoint-server --> respectively so the opening
and closing markers match the new tag name.

@codecov
Copy link

codecov bot commented Feb 18, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Copy link
Contributor

@debermudez debermudez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks solid to me

Copy link
Contributor

@ajcasagrande ajcasagrande left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The content looks great, but i think potentially some gaps

Comment on lines 195 to 228
---

## Random Pool Dataset Loader

The `random_pool` dataset type allows custom data with random sampling:

### Use Cases

1. **Single File**: All entries form one pool, randomly sampled for each request
2. **Directory**: Each file becomes a separate pool, merged into conversations

### Single File Example

Create `prompts.jsonl`:
```jsonl
{"text": "Who are you?", "image": "/path/to/image1.png"}
{"text": "Explain quantum mechanics.", "image": "/path/to/image2.png"}
{"text": "What is the capital of France?", "image": "/path/to/image3.png"}
```

Run benchmark:
```bash
aiperf profile \
--model your-model \
--url localhost:8000 \
--endpoint-type chat \
--input-file prompts.jsonl \
--custom-dataset-type random_pool \
--request-count 100
```

AIPerf randomly samples from the 3 entries to generate 100 requests (with replacement).

---
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems out of place, as random pool is not synthetic

Comment on lines 53 to 56
**Options:**
- `--synthetic-input-tokens-mean`: Mean token count (default: 550)
- `--synthetic-input-tokens-stddev`: Standard deviation for length variability (default: 0)
- `--random-seed`: Seed for reproducible prompt generation (default: None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing info about the sequence length distribution, which is a big part of text gen

SPDX-License-Identifier: Apache-2.0
-->

# Synthetic Dataset Generation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we include details about text generation from hash_ids in trace datasets? Those generate text using shakespeare corpus as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the right place to add that is the trace_replay.md instead of here, since we have a dedicate document for it.
Let me know your thoughts on it.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
docs/tutorials/vision.md (1)

28-32: Consider using /v1/models for the readiness probe instead of /v1/chat/completions.

The current probe fires an actual inference POST (/v1/chat/completions) on every 2-second poll during the 15-minute startup window, which wastes GPU resources. /health only confirms the server process is running, not that the model is loaded. The idiomatic lightweight readiness check for vLLM that verifies the model is loaded is a GET against /v1/models.

✏️ Suggested lighter readiness probe
-<!-- health-check-vllm-vision-openai-endpoint-server -->
-```bash
-timeout 900 bash -c 'while [ "$(curl -s -o /dev/null -w "%{http_code}" localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d "{\"model\":\"Qwen/Qwen2-VL-2B-Instruct\",\"messages\":[{\"role\":\"user\",\"content\":\"test\"}],\"max_tokens\":1}")" != "200" ]; do sleep 2; done' || { echo "vLLM not ready after 15min"; exit 1; }
-```
-<!-- /health-check-vllm-vision-openai-endpoint-server -->
+<!-- health-check-vllm-vision-openai-endpoint-server -->
+```bash
+timeout 900 bash -c 'while [ "$(curl -s -o /dev/null -w "%{http_code}" localhost:8000/v1/models)" != "200" ]; do sleep 2; done' || { echo "vLLM not ready after 15min"; exit 1; }
+```
+<!-- /health-check-vllm-vision-openai-endpoint-server -->
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/tutorials/vision.md` around lines 28 - 32, The readiness probe currently
posts to /v1/chat/completions (causing expensive inferences); change it to
perform a lightweight GET against /v1/models instead. In the
docs/tutorials/vision.md snippet that defines the health-check block (<!--
health-check-vllm-vision-openai-endpoint-server -->) replace the loop curl
target from POSTing to /v1/chat/completions to a simple GET to /v1/models so the
loop checks status code 200 without triggering model inference.
docs/tutorials/synthetic-dataset.md (1)

27-27: Prefer #### subsection headings under ## per repo convention.

All ### How It Works / ### Configuration headings throughout this file (lines 27, 38, 61, 84, 94, 127, 143) use h3 under h2. Based on learnings, the maintainer prefers h4 (####) for subsections under h2 headings for better visual sizing.

✏️ Example fix (apply to all ### subsections)
-### How It Works
+#### How It Works

-### Configuration
+#### Configuration

-### Advanced: Prefix Synthesis
+#### Advanced: Prefix Synthesis

Based on learnings: "In the aiperf repository's docs/metrics_reference.md file, the maintainer prefers using h4 headings (####) for subsections under h2 headings instead of h3 (###) for better visual sizing and readability."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/tutorials/synthetic-dataset.md` at line 27, Update all subsection
headings that currently use "###" under top-level "##" sections to use "####"
instead for consistent visual sizing; specifically replace headings like "###
How It Works" and "### Configuration" (and the other ### subsections at the
listed locations) with "#### How It Works" and "#### Configuration" throughout
the document so every h3 under an h2 becomes an h4, ensuring consistent styling
across the file.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/tutorials/synthetic-dataset.md`:
- Around line 184-201: The CI marker currently bundles a privileged system
install inside the AIPerf execution block; move the `sudo apt-get install -y
ffmpeg` line out of the `<!-- aiperf-run-vllm-vision-openai-endpoint-server -->`
block and place it in a new or existing setup marker (e.g., `<!--
setup-vllm-vision-video-openai-endpoint-server -->`) before the aiperf block so
CI runs the setup step separately from the benchmark command, keeping the aiperf
block strictly for the `aiperf profile ...` command.

---

Nitpick comments:
In `@docs/tutorials/synthetic-dataset.md`:
- Line 27: Update all subsection headings that currently use "###" under
top-level "##" sections to use "####" instead for consistent visual sizing;
specifically replace headings like "### How It Works" and "### Configuration"
(and the other ### subsections at the listed locations) with "#### How It Works"
and "#### Configuration" throughout the document so every h3 under an h2 becomes
an h4, ensuring consistent styling across the file.

In `@docs/tutorials/vision.md`:
- Around line 28-32: The readiness probe currently posts to /v1/chat/completions
(causing expensive inferences); change it to perform a lightweight GET against
/v1/models instead. In the docs/tutorials/vision.md snippet that defines the
health-check block (<!-- health-check-vllm-vision-openai-endpoint-server -->)
replace the loop curl target from POSTing to /v1/chat/completions to a simple
GET to /v1/models so the loop checks status code 200 without triggering model
inference.

@lkomali lkomali force-pushed the lkomali/clarify-synthetic-datasets branch from a0751b3 to 5e6fd89 Compare February 19, 2026 18:21
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@docs/tutorials/synthetic-dataset.md`:
- Around line 184-201: The apt-get installation command is currently inside the
aiperf-run marker block; move that system-level prerequisite into its own setup
marker. Create a separate fenced bash block wrapped with <!--
setup-vllm-vision-video-openai-endpoint-server --> ... <!--
/setup-vllm-vision-video-openai-endpoint-server --> containing "apt-get update
&& apt-get install -y ffmpeg" and remove that line from the <!--
aiperf-run-vllm-vision-openai-endpoint-server --> block so the aiperf profile
invocation (the aiperf profile command) remains only inside the aiperf-run
marker.

@lkomali lkomali requested a review from ajcasagrande February 19, 2026 19:07
Signed-off-by: lkomali <lkomali@nvidia.com>
Signed-off-by: lkomali <lkomali@nvidia.com>
Signed-off-by: lkomali <lkomali@nvidia.com>
…ile for CI

Signed-off-by: lkomali <lkomali@nvidia.com>
@lkomali lkomali force-pushed the lkomali/clarify-synthetic-datasets branch from d13e4d0 to 4639e88 Compare February 19, 2026 19:28
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
docs/tutorials/vision.md (1)

86-111: ⚠️ Potential issue | 🟡 Minor

Duplicate HTML comment marker — copy-paste error

The marker aiperf-run-vllm-vision-openai-endpoint-server is used three times in the file: lines 40–53 (synthetic image profiling), lines 86–96 (custom JSONL creation), and lines 100–111 (run with custom input). Since each section serves a different purpose, the markers should be uniquely named to avoid confusion and prevent issues if these markers are ever consumed by CI/automation pipelines.

✏️ Proposed fix
-<!-- aiperf-run-vllm-vision-openai-endpoint-server -->
+<!-- create-custom-input-file-vllm-vision-openai-endpoint-server -->
 ```bash
 cat <<EOF > inputs.jsonl
 ...
 EOF

-
+

Run AIPerf using the custom input file:

-
+

aiperf profile \
    ...

-
+

</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @docs/tutorials/vision.md around lines 86 - 111, The three HTML comment
markers named aiperf-run-vllm-vision-openai-endpoint-server are duplicated;
rename the closing marker after the custom JSONL block and the open/close
markers around the "Run AIPerf using the custom input file" block to unique
names so each section is distinct (e.g., change the closing marker after the
inputs.jsonl here and change the surrounding open/close markers for the run
block), updating the matching start/end comment pairs so the markers are unique
and paired correctly (refer to the markers
aiperf-run-vllm-vision-openai-endpoint-server, the closing comment after the
inputs.jsonl block, and the open/close comments around the aiperf profile run).


</details>

</blockquote></details>

</blockquote></details>
🧹 Nitpick comments (1)
docs/tutorials/vision.md (1)

28-32: Consider polling /v1/models instead of /v1/chat/completions for the readiness check

vLLM's /health endpoint only indicates that the server process is running, not that the model is loaded and ready to serve. The current approach polls /v1/chat/completions with max_tokens:1, which does confirm the model is fully loaded, but it also triggers a real inference request on every 2-second poll iteration for up to 15 minutes — wasteful for a large VLM startup. /v1/models is the standard model-aware readiness endpoint recommended for startup probes, returning 200 only once the model is loaded.

✏️ Suggested alternative
-timeout 900 bash -c 'while [ "$(curl -s -o /dev/null -w "%{http_code}" localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d "{\"model\":\"Qwen/Qwen2-VL-2B-Instruct\",\"messages\":[{\"role\":\"user\",\"content\":\"test\"}],\"max_tokens\":1}")" != "200" ]; do sleep 2; done' || { echo "vLLM not ready after 15min"; exit 1; }
+timeout 900 bash -c 'while [ "$(curl -s -o /dev/null -w "%{http_code}" localhost:8000/v1/models)" != "200" ]; do sleep 2; done' || { echo "vLLM not ready after 15min"; exit 1; }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/tutorials/vision.md` around lines 28 - 32, The readiness check currently
polls the heavy inference endpoint '/v1/chat/completions' (see the curl command
in the diff), causing repeated real inferences; change the probe to query the
model-aware endpoint '/v1/models' instead so it returns 200 only when the model
is loaded; update the bash timeout loop to curl localhost:8000/v1/models (no
request body) and retain the same timeout/sleep behavior and error handling
message so startup waits for model readiness without triggering inference.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@Dockerfile`:
- Around line 119-121: The Dockerfile's ENV line sets LD_LIBRARY_PATH with a
trailing colon by expanding a possibly-empty ${LD_LIBRARY_PATH}, which appends
the CWD to the linker search path; update the ENV declaration that sets PATH and
LD_LIBRARY_PATH so LD_LIBRARY_PATH is assigned directly to "/opt/ffmpeg/lib" (no
${LD_LIBRARY_PATH} expansion or trailing colon) and apply the same change in the
runtime stage where LD_LIBRARY_PATH is set (the ENV entries for PATH and
LD_LIBRARY_PATH in the build and runtime stages) to avoid unintentionally adding
the current working directory to the dynamic loader search path.

---

Outside diff comments:
In `@docs/tutorials/vision.md`:
- Around line 86-111: The three HTML comment markers named
aiperf-run-vllm-vision-openai-endpoint-server are duplicated; rename the closing
marker after the custom JSONL block and the open/close markers around the "Run
AIPerf using the custom input file" block to unique names so each section is
distinct (e.g., change the closing marker after the inputs.jsonl here and change
the surrounding open/close markers for the run block), updating the matching
start/end comment pairs so the markers are unique and paired correctly (refer to
the markers aiperf-run-vllm-vision-openai-endpoint-server, the closing comment
after the inputs.jsonl block, and the open/close comments around the aiperf
profile run).

---

Nitpick comments:
In `@docs/tutorials/vision.md`:
- Around line 28-32: The readiness check currently polls the heavy inference
endpoint '/v1/chat/completions' (see the curl command in the diff), causing
repeated real inferences; change the probe to query the model-aware endpoint
'/v1/models' instead so it returns 200 only when the model is loaded; update the
bash timeout loop to curl localhost:8000/v1/models (no request body) and retain
the same timeout/sleep behavior and error handling message so startup waits for
model readiness without triggering inference.

Signed-off-by: lkomali <lkomali@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments