Skip to content

feat: Refactor ShareGPT from BasePublicDatasetLoader to BaseFileLoader#657

Draft
ardecode wants to merge 1 commit intoai-dynamo:mainfrom
ardecode:feat/refactor-datasets-628
Draft

feat: Refactor ShareGPT from BasePublicDatasetLoader to BaseFileLoader#657
ardecode wants to merge 1 commit intoai-dynamo:mainfrom
ardecode:feat/refactor-datasets-628

Conversation

@ardecode
Copy link

@ardecode ardecode commented Feb 8, 2026

Summary

Closes #628

  • Refactors ShareGPTLoader from BasePublicDatasetLoader to BaseFileLoader, separating download/caching from parsing so public datasets reuse the same file-based loader pattern as local datasets
  • Extracts download/cache logic into DatasetManager._download_public_dataset().

Summary by CodeRabbit

  • New Features

    • Implemented automatic dataset caching system for faster loading and improved performance
    • Enhanced ShareGPT dataset loader with local file support
  • Deprecations

    • BasePublicDatasetLoader has been deprecated; please migrate to the new loader implementation
  • Improvements

    • Added validation for dataset sequences to ensure data quality

@github-actions
Copy link

github-actions bot commented Feb 8, 2026

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@e61e2ce0b984992f914ef964961c8df33834b29b

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@e61e2ce0b984992f914ef964961c8df33834b29b

Last updated for commit: e61e2ceBrowse code

@github-actions github-actions bot added the feat label Feb 8, 2026
ai-dynamo#628)

Separate download/caching from parsing so public datasets reuse the same
file-based loader pattern as local datasets. Download logic moves to
DatasetManager, and ShareGPT becomes a regular custom_dataset_loader plugin.
BasePublicDatasetLoader is deprecated.

Signed-off-by: ardecode <desaiarijit@gmail.com>
@ardecode ardecode force-pushed the feat/refactor-datasets-628 branch from 5004194 to e61e2ce Compare February 8, 2026 23:15
@coderabbitai
Copy link

coderabbitai bot commented Feb 8, 2026

Walkthrough

The changes refactor public dataset loaders into actual loader implementations using a spec-driven approach. ShareGPT loader migrates from async remote-download to synchronous local-file operations. A new caching system with AioHttpClient handles public dataset downloads. BaseFileLoader gains a sequence validation method. BasePublicDatasetLoader is deprecated in favor of direct loader implementations.

Changes

Cohort / File(s) Summary
Dataset Manager Refactoring
src/aiperf/dataset/dataset_manager.py
Added PublicDatasetSpec dataclass and PUBLIC_DATASETS mapping for spec-driven public dataset configuration. Introduced _download_public_dataset async method for caching datasets locally using AioHttpClient. Added AIPERF_DATASET_CACHE_DIR constant and integrated dynamic loader instantiation from spec.
Base Loader Enhancements
src/aiperf/dataset/loader/base_loader.py, src/aiperf/dataset/loader/base_public_dataset.py
Added is_valid_sequence method to BaseFileLoader for validating prompt-output sequences with configurable length constraints. Added deprecation warning to BasePublicDatasetLoader via runtime warning during initialization.
ShareGPT Loader Migration
src/aiperf/dataset/loader/sharegpt.py
Migrated base class from BasePublicDatasetLoader to BaseFileLoader. Changed from async remote-download to synchronous local-file operations. Added can_load classmethod, renamed get_recommended_sampling_strategy to get_preferred_sampling_strategy. Updated load_dataset and convert_to_conversations to sync methods with revised signatures.
Plugin Configuration and Tests
src/aiperf/plugin/plugins.yaml, tests/unit/dataset/loader/test_sharegpt.py, tests/unit/dataset/test_dataset_manager.py
Registered ShareGPT loader in plugins.yaml. Updated ShareGPT tests to use filename parameter and call sync methods. Added mock patches for _download_public_dataset in dataset manager tests.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Hoppy news! Public datasets now load the loader way,
Caching files with AioHttp, keeping code at bay,
ShareGPT hops from async clouds to local files so neat,
BaseLoaders gain validation—refactoring complete! 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely describes the main refactoring: migrating ShareGPT from BasePublicDatasetLoader to BaseFileLoader, which is the primary change in this PR.
Linked Issues check ✅ Passed The PR fully implements the objectives from issue #628: it refactors ShareGPT to be a concrete loader implementation, decouples download/caching (moved to DatasetManager._download_public_dataset) from parsing, and treats download as a facade over loaders.
Out of Scope Changes check ✅ Passed All changes are directly related to the refactoring objectives: new spec-driven loader management, BaseFileLoader enhancements, BasePublicDatasetLoader deprecation, and supporting test updates. No extraneous changes detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/aiperf/dataset/dataset_manager.py`:
- Around line 297-320: In _download_public_dataset, the current try block wraps
both the HTTP call (AioHttpClient.get_request) and file write
(cache_filepath.write_text) so failures during download produce a misleading
"Error saving ..." message; split the operations into two try/excepts: first
await http_client.get_request(...) and catch exceptions to raise a
DatasetLoaderError like "Error downloading {tag} dataset: {e}", then perform
cache_filepath.parent.mkdir(...) and cache_filepath.write_text(...) in a
separate try/except that raises "Error saving {tag} dataset to local cache:
{e}"; keep the finally block that awaits http_client.close() so the client is
always closed.
🧹 Nitpick comments (4)
src/aiperf/dataset/loader/base_loader.py (1)

62-95: Consider moving is_valid_sequence up to BaseLoader to eliminate duplication.

This method is duplicated verbatim in BasePublicDatasetLoader (lines 164–198 of base_public_dataset.py). Since BasePublicDatasetLoader is deprecated but still in use, moving this to BaseLoader would let both subclasses inherit it without duplication.

♻️ Suggested refactor

Move is_valid_sequence from BaseFileLoader to BaseLoader, so both BaseFileLoader and BasePublicDatasetLoader inherit it. Then remove the copy from BasePublicDatasetLoader.

src/aiperf/dataset/dataset_manager.py (1)

272-295: Blocking synchronous I/O called from async context.

loader.load_dataset() (line 289) and loader.convert_to_conversations(dataset) (line 295) are synchronous and perform file I/O and CPU-intensive tokenization respectively. For large datasets like ShareGPT (~90k entries), this blocks the event loop.

I note the existing _load_custom_dataset and _load_synthetic_dataset follow the same blocking pattern, so this isn't a regression — but worth flagging for future improvement (e.g., wrapping in asyncio.to_thread).

As per coding guidelines: "Use async/await for ALL I/O operations - no time.sleep, no blocking calls".

src/aiperf/dataset/loader/sharegpt.py (2)

51-67: Unused filename parameter in can_load.

The filename parameter is declared but never used in the method body. If it's part of a shared interface signature, consider documenting that or prefixing with _. Otherwise, it may confuse callers into thinking filename-based detection is supported.


37-49: self.user_config is assigned twice.

self.user_config is set on line 41, then super().__init__() (line 47) chains to BaseLoader.__init__ which also sets self.user_config. The redundant assignment on line 41 can be removed if super().__init__ is called first, or simply documented as intentional (needed before super() for the self.output_tokens_mean access on line 42).

Comment on lines +297 to +320
async def _download_public_dataset(self, tag: str, url: str, filename: str) -> Path:
cache_filepath = AIPERF_DATASET_CACHE_DIR / filename

if cache_filepath.exists():
self.info(f"Loading {tag} dataset from local cache {cache_filepath}")
return cache_filepath

self.info(f"No local dataset cache found, downloading {tag} from {url}")
http_client = AioHttpClient(timeout=Environment.DATASET.PUBLIC_DATASET_TIMEOUT)
try:
record = await http_client.get_request(
url, headers={"Accept": "application/json"}
)
dataset = record.responses[0].text
cache_filepath.parent.mkdir(parents=True, exist_ok=True)
cache_filepath.write_text(dataset)
except Exception as e:
raise DatasetLoaderError(
f"Error saving {tag} dataset to local cache: {e}"
) from e
finally:
await http_client.close()

return cache_filepath
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Misleading error message when the HTTP request itself fails.

The try block covers both the HTTP download (lines 307-310) and the file write (lines 311-312). If get_request raises, the error message says "Error saving … to local cache" which is misleading.

♻️ Suggested fix: separate error handling
     async def _download_public_dataset(self, tag: str, url: str, filename: str) -> Path:
         cache_filepath = AIPERF_DATASET_CACHE_DIR / filename
 
         if cache_filepath.exists():
             self.info(f"Loading {tag} dataset from local cache {cache_filepath}")
             return cache_filepath
 
         self.info(f"No local dataset cache found, downloading {tag} from {url}")
         http_client = AioHttpClient(timeout=Environment.DATASET.PUBLIC_DATASET_TIMEOUT)
         try:
             record = await http_client.get_request(
                 url, headers={"Accept": "application/json"}
             )
             dataset = record.responses[0].text
+        except Exception as e:
+            raise DatasetLoaderError(
+                f"Error downloading {tag} dataset from {url}: {e}"
+            ) from e
+        finally:
+            await http_client.close()
+
+        try:
             cache_filepath.parent.mkdir(parents=True, exist_ok=True)
             cache_filepath.write_text(dataset)
         except Exception as e:
             raise DatasetLoaderError(
                 f"Error saving {tag} dataset to local cache: {e}"
             ) from e
-        finally:
-            await http_client.close()
 
         return cache_filepath
🧰 Tools
🪛 Ruff (0.14.14)

[warning] 314-316: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
In `@src/aiperf/dataset/dataset_manager.py` around lines 297 - 320, In
_download_public_dataset, the current try block wraps both the HTTP call
(AioHttpClient.get_request) and file write (cache_filepath.write_text) so
failures during download produce a misleading "Error saving ..." message; split
the operations into two try/excepts: first await http_client.get_request(...)
and catch exceptions to raise a DatasetLoaderError like "Error downloading {tag}
dataset: {e}", then perform cache_filepath.parent.mkdir(...) and
cache_filepath.write_text(...) in a separate try/except that raises "Error
saving {tag} dataset to local cache: {e}"; keep the finally block that awaits
http_client.close() so the client is always closed.

@codecov
Copy link

codecov bot commented Feb 8, 2026

Codecov Report

❌ Patch coverage is 65.62500% with 22 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/aiperf/dataset/dataset_manager.py 54.28% 15 Missing and 1 partial ⚠️
src/aiperf/dataset/loader/sharegpt.py 76.19% 4 Missing and 1 partial ⚠️
src/aiperf/dataset/loader/base_public_dataset.py 50.00% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@ardecode ardecode marked this pull request as draft February 8, 2026 23:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Refactor PublicDatasets to be actual Loaders implementations

1 participant

Comments