feat: Refactor ShareGPT from BasePublicDatasetLoader to BaseFileLoader by ardecode · Pull Request #657 · ai-dynamo/aiperf

ardecode · 2026-02-08T23:13:58Z

Summary

Closes #628

Refactors ShareGPTLoader from BasePublicDatasetLoader to BaseFileLoader, separating download/caching from parsing so public datasets reuse the same file-based loader pattern as local datasets
Extracts download/cache logic into DatasetManager._download_public_dataset().

Summary by CodeRabbit

New Features
- Implemented automatic dataset caching system for faster loading and improved performance
- Enhanced ShareGPT dataset loader with local file support
Deprecations
- BasePublicDatasetLoader has been deprecated; please migrate to the new loader implementation
Improvements
- Added validation for dataset sequences to ensure data quality

github-actions · 2026-02-08T23:14:08Z

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@e61e2ce0b984992f914ef964961c8df33834b29b

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@e61e2ce0b984992f914ef964961c8df33834b29b

Last updated for commit: e61e2ce • Browse code

ai-dynamo#628) Separate download/caching from parsing so public datasets reuse the same file-based loader pattern as local datasets. Download logic moves to DatasetManager, and ShareGPT becomes a regular custom_dataset_loader plugin. BasePublicDatasetLoader is deprecated. Signed-off-by: ardecode <desaiarijit@gmail.com>

coderabbitai · 2026-02-08T23:16:35Z

Walkthrough

The changes refactor public dataset loaders into actual loader implementations using a spec-driven approach. ShareGPT loader migrates from async remote-download to synchronous local-file operations. A new caching system with AioHttpClient handles public dataset downloads. BaseFileLoader gains a sequence validation method. BasePublicDatasetLoader is deprecated in favor of direct loader implementations.

Changes

Cohort / File(s)	Summary
Dataset Manager Refactoring `src/aiperf/dataset/dataset_manager.py`	Added PublicDatasetSpec dataclass and PUBLIC_DATASETS mapping for spec-driven public dataset configuration. Introduced _download_public_dataset async method for caching datasets locally using AioHttpClient. Added AIPERF_DATASET_CACHE_DIR constant and integrated dynamic loader instantiation from spec.
Base Loader Enhancements `src/aiperf/dataset/loader/base_loader.py`, `src/aiperf/dataset/loader/base_public_dataset.py`	Added is_valid_sequence method to BaseFileLoader for validating prompt-output sequences with configurable length constraints. Added deprecation warning to BasePublicDatasetLoader via runtime warning during initialization.
ShareGPT Loader Migration `src/aiperf/dataset/loader/sharegpt.py`	Migrated base class from BasePublicDatasetLoader to BaseFileLoader. Changed from async remote-download to synchronous local-file operations. Added can_load classmethod, renamed get_recommended_sampling_strategy to get_preferred_sampling_strategy. Updated load_dataset and convert_to_conversations to sync methods with revised signatures.
Plugin Configuration and Tests `src/aiperf/plugin/plugins.yaml`, `tests/unit/dataset/loader/test_sharegpt.py`, `tests/unit/dataset/test_dataset_manager.py`	Registered ShareGPT loader in plugins.yaml. Updated ShareGPT tests to use filename parameter and call sync methods. Added mock patches for _download_public_dataset in dataset manager tests.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Hoppy news! Public datasets now load the loader way,
Caching files with AioHttp, keeping code at bay,
ShareGPT hops from async clouds to local files so neat,
BaseLoaders gain validation—refactoring complete! 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely describes the main refactoring: migrating ShareGPT from BasePublicDatasetLoader to BaseFileLoader, which is the primary change in this PR.
Linked Issues check	✅ Passed	The PR fully implements the objectives from issue `#628`: it refactors ShareGPT to be a concrete loader implementation, decouples download/caching (moved to DatasetManager._download_public_dataset) from parsing, and treats download as a facade over loaders.
Out of Scope Changes check	✅ Passed	All changes are directly related to the refactoring objectives: new spec-driven loader management, BaseFileLoader enhancements, BasePublicDatasetLoader deprecation, and supporting test updates. No extraneous changes detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@src/aiperf/dataset/dataset_manager.py`:
- Around line 297-320: In _download_public_dataset, the current try block wraps
both the HTTP call (AioHttpClient.get_request) and file write
(cache_filepath.write_text) so failures during download produce a misleading
"Error saving ..." message; split the operations into two try/excepts: first
await http_client.get_request(...) and catch exceptions to raise a
DatasetLoaderError like "Error downloading {tag} dataset: {e}", then perform
cache_filepath.parent.mkdir(...) and cache_filepath.write_text(...) in a
separate try/except that raises "Error saving {tag} dataset to local cache:
{e}"; keep the finally block that awaits http_client.close() so the client is
always closed.

🧹 Nitpick comments (4)

src/aiperf/dataset/loader/base_loader.py (1)

62-95: Consider moving is_valid_sequence up to BaseLoader to eliminate duplication.

This method is duplicated verbatim in BasePublicDatasetLoader (lines 164–198 of base_public_dataset.py). Since BasePublicDatasetLoader is deprecated but still in use, moving this to BaseLoader would let both subclasses inherit it without duplication.

♻️ Suggested refactor

Move is_valid_sequence from BaseFileLoader to BaseLoader, so both BaseFileLoader and BasePublicDatasetLoader inherit it. Then remove the copy from BasePublicDatasetLoader.

src/aiperf/dataset/dataset_manager.py (1)

272-295: Blocking synchronous I/O called from async context.

loader.load_dataset() (line 289) and loader.convert_to_conversations(dataset) (line 295) are synchronous and perform file I/O and CPU-intensive tokenization respectively. For large datasets like ShareGPT (~90k entries), this blocks the event loop.

I note the existing _load_custom_dataset and _load_synthetic_dataset follow the same blocking pattern, so this isn't a regression — but worth flagging for future improvement (e.g., wrapping in asyncio.to_thread).

As per coding guidelines: "Use async/await for ALL I/O operations - no time.sleep, no blocking calls".

src/aiperf/dataset/loader/sharegpt.py (2)

51-67: Unused filename parameter in can_load.

The filename parameter is declared but never used in the method body. If it's part of a shared interface signature, consider documenting that or prefixing with _. Otherwise, it may confuse callers into thinking filename-based detection is supported.

37-49: self.user_config is assigned twice.

self.user_config is set on line 41, then super().__init__() (line 47) chains to BaseLoader.__init__ which also sets self.user_config. The redundant assignment on line 41 can be removed if super().__init__ is called first, or simply documented as intentional (needed before super() for the self.output_tokens_mean access on line 42).

coderabbitai · 2026-02-08T23:21:15Z

src/aiperf/dataset/dataset_manager.py

+    async def _download_public_dataset(self, tag: str, url: str, filename: str) -> Path:
+        cache_filepath = AIPERF_DATASET_CACHE_DIR / filename
+
+        if cache_filepath.exists():
+            self.info(f"Loading {tag} dataset from local cache {cache_filepath}")
+            return cache_filepath
+
+        self.info(f"No local dataset cache found, downloading {tag} from {url}")
+        http_client = AioHttpClient(timeout=Environment.DATASET.PUBLIC_DATASET_TIMEOUT)
+        try:
+            record = await http_client.get_request(
+                url, headers={"Accept": "application/json"}
+            )
+            dataset = record.responses[0].text
+            cache_filepath.parent.mkdir(parents=True, exist_ok=True)
+            cache_filepath.write_text(dataset)
+        except Exception as e:
+            raise DatasetLoaderError(
+                f"Error saving {tag} dataset to local cache: {e}"
+            ) from e
+        finally:
+            await http_client.close()
+
+        return cache_filepath


⚠️ Potential issue | 🟡 Minor

Misleading error message when the HTTP request itself fails.

The try block covers both the HTTP download (lines 307-310) and the file write (lines 311-312). If get_request raises, the error message says "Error saving … to local cache" which is misleading.

♻️ Suggested fix: separate error handling

async def _download_public_dataset(self, tag: str, url: str, filename: str) -> Path: cache_filepath = AIPERF_DATASET_CACHE_DIR / filename if cache_filepath.exists(): self.info(f"Loading {tag} dataset from local cache {cache_filepath}") return cache_filepath self.info(f"No local dataset cache found, downloading {tag} from {url}") http_client = AioHttpClient(timeout=Environment.DATASET.PUBLIC_DATASET_TIMEOUT) try: record = await http_client.get_request( url, headers={"Accept": "application/json"} ) dataset = record.responses[0].text + except Exception as e: + raise DatasetLoaderError( + f"Error downloading {tag} dataset from {url}: {e}" + ) from e + finally: + await http_client.close() + + try: cache_filepath.parent.mkdir(parents=True, exist_ok=True) cache_filepath.write_text(dataset) except Exception as e: raise DatasetLoaderError( f"Error saving {tag} dataset to local cache: {e}" ) from e - finally: - await http_client.close() return cache_filepath

🧰 Tools

🪛 Ruff (0.14.14)

[warning] 314-316: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents

In `@src/aiperf/dataset/dataset_manager.py` around lines 297 - 320, In _download_public_dataset, the current try block wraps both the HTTP call (AioHttpClient.get_request) and file write (cache_filepath.write_text) so failures during download produce a misleading "Error saving ..." message; split the operations into two try/excepts: first await http_client.get_request(...) and catch exceptions to raise a DatasetLoaderError like "Error downloading {tag} dataset: {e}", then perform cache_filepath.parent.mkdir(...) and cache_filepath.write_text(...) in a separate try/except that raises "Error saving {tag} dataset to local cache: {e}"; keep the finally block that awaits http_client.close() so the client is always closed.

codecov · 2026-02-08T23:22:05Z

Codecov Report

❌ Patch coverage is 65.62500% with 22 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/aiperf/dataset/dataset_manager.py	54.28%	15 Missing and 1 partial ⚠️
src/aiperf/dataset/loader/sharegpt.py	76.19%	4 Missing and 1 partial ⚠️
src/aiperf/dataset/loader/base_public_dataset.py	50.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

github-actions bot added the feat label Feb 8, 2026

ardecode force-pushed the feat/refactor-datasets-628 branch from 5004194 to e61e2ce Compare February 8, 2026 23:15

coderabbitai bot reviewed Feb 8, 2026

View reviewed changes

ardecode marked this pull request as draft February 8, 2026 23:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Refactor ShareGPT from BasePublicDatasetLoader to BaseFileLoader#657

feat: Refactor ShareGPT from BasePublicDatasetLoader to BaseFileLoader#657
ardecode wants to merge 1 commit intoai-dynamo:mainfrom
ardecode:feat/refactor-datasets-628

ardecode commented Feb 8, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

github-actions bot commented Feb 8, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Feb 8, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 8, 2026

Uh oh!

codecov bot commented Feb 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

ardecode commented Feb 8, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Summary by CodeRabbit

Uh oh!

github-actions bot commented Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Try out this PR

Uh oh!

coderabbitai bot commented Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Feb 8, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

ardecode commented Feb 8, 2026 •

edited by coderabbitai bot

Loading

github-actions bot commented Feb 8, 2026 •

edited

Loading

coderabbitai bot commented Feb 8, 2026 •

edited

Loading