feat: Refactor ShareGPT from BasePublicDatasetLoader to BaseFileLoader#657
feat: Refactor ShareGPT from BasePublicDatasetLoader to BaseFileLoader#657ardecode wants to merge 1 commit intoai-dynamo:mainfrom
Conversation
Try out this PRQuick install: pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@e61e2ce0b984992f914ef964961c8df33834b29bRecommended with virtual environment (using uv): uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@e61e2ce0b984992f914ef964961c8df33834b29bLast updated for commit: |
ai-dynamo#628) Separate download/caching from parsing so public datasets reuse the same file-based loader pattern as local datasets. Download logic moves to DatasetManager, and ShareGPT becomes a regular custom_dataset_loader plugin. BasePublicDatasetLoader is deprecated. Signed-off-by: ardecode <desaiarijit@gmail.com>
5004194 to
e61e2ce
Compare
WalkthroughThe changes refactor public dataset loaders into actual loader implementations using a spec-driven approach. ShareGPT loader migrates from async remote-download to synchronous local-file operations. A new caching system with AioHttpClient handles public dataset downloads. BaseFileLoader gains a sequence validation method. BasePublicDatasetLoader is deprecated in favor of direct loader implementations. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/aiperf/dataset/dataset_manager.py`:
- Around line 297-320: In _download_public_dataset, the current try block wraps
both the HTTP call (AioHttpClient.get_request) and file write
(cache_filepath.write_text) so failures during download produce a misleading
"Error saving ..." message; split the operations into two try/excepts: first
await http_client.get_request(...) and catch exceptions to raise a
DatasetLoaderError like "Error downloading {tag} dataset: {e}", then perform
cache_filepath.parent.mkdir(...) and cache_filepath.write_text(...) in a
separate try/except that raises "Error saving {tag} dataset to local cache:
{e}"; keep the finally block that awaits http_client.close() so the client is
always closed.
🧹 Nitpick comments (4)
src/aiperf/dataset/loader/base_loader.py (1)
62-95: Consider movingis_valid_sequenceup toBaseLoaderto eliminate duplication.This method is duplicated verbatim in
BasePublicDatasetLoader(lines 164–198 ofbase_public_dataset.py). SinceBasePublicDatasetLoaderis deprecated but still in use, moving this toBaseLoaderwould let both subclasses inherit it without duplication.♻️ Suggested refactor
Move
is_valid_sequencefromBaseFileLoadertoBaseLoader, so bothBaseFileLoaderandBasePublicDatasetLoaderinherit it. Then remove the copy fromBasePublicDatasetLoader.src/aiperf/dataset/dataset_manager.py (1)
272-295: Blocking synchronous I/O called from async context.
loader.load_dataset()(line 289) andloader.convert_to_conversations(dataset)(line 295) are synchronous and perform file I/O and CPU-intensive tokenization respectively. For large datasets like ShareGPT (~90k entries), this blocks the event loop.I note the existing
_load_custom_datasetand_load_synthetic_datasetfollow the same blocking pattern, so this isn't a regression — but worth flagging for future improvement (e.g., wrapping inasyncio.to_thread).As per coding guidelines: "Use async/await for ALL I/O operations - no
time.sleep, no blocking calls".src/aiperf/dataset/loader/sharegpt.py (2)
51-67: Unusedfilenameparameter incan_load.The
filenameparameter is declared but never used in the method body. If it's part of a shared interface signature, consider documenting that or prefixing with_. Otherwise, it may confuse callers into thinking filename-based detection is supported.
37-49:self.user_configis assigned twice.
self.user_configis set on line 41, thensuper().__init__()(line 47) chains toBaseLoader.__init__which also setsself.user_config. The redundant assignment on line 41 can be removed ifsuper().__init__is called first, or simply documented as intentional (needed beforesuper()for theself.output_tokens_meanaccess on line 42).
| async def _download_public_dataset(self, tag: str, url: str, filename: str) -> Path: | ||
| cache_filepath = AIPERF_DATASET_CACHE_DIR / filename | ||
|
|
||
| if cache_filepath.exists(): | ||
| self.info(f"Loading {tag} dataset from local cache {cache_filepath}") | ||
| return cache_filepath | ||
|
|
||
| self.info(f"No local dataset cache found, downloading {tag} from {url}") | ||
| http_client = AioHttpClient(timeout=Environment.DATASET.PUBLIC_DATASET_TIMEOUT) | ||
| try: | ||
| record = await http_client.get_request( | ||
| url, headers={"Accept": "application/json"} | ||
| ) | ||
| dataset = record.responses[0].text | ||
| cache_filepath.parent.mkdir(parents=True, exist_ok=True) | ||
| cache_filepath.write_text(dataset) | ||
| except Exception as e: | ||
| raise DatasetLoaderError( | ||
| f"Error saving {tag} dataset to local cache: {e}" | ||
| ) from e | ||
| finally: | ||
| await http_client.close() | ||
|
|
||
| return cache_filepath |
There was a problem hiding this comment.
Misleading error message when the HTTP request itself fails.
The try block covers both the HTTP download (lines 307-310) and the file write (lines 311-312). If get_request raises, the error message says "Error saving … to local cache" which is misleading.
♻️ Suggested fix: separate error handling
async def _download_public_dataset(self, tag: str, url: str, filename: str) -> Path:
cache_filepath = AIPERF_DATASET_CACHE_DIR / filename
if cache_filepath.exists():
self.info(f"Loading {tag} dataset from local cache {cache_filepath}")
return cache_filepath
self.info(f"No local dataset cache found, downloading {tag} from {url}")
http_client = AioHttpClient(timeout=Environment.DATASET.PUBLIC_DATASET_TIMEOUT)
try:
record = await http_client.get_request(
url, headers={"Accept": "application/json"}
)
dataset = record.responses[0].text
+ except Exception as e:
+ raise DatasetLoaderError(
+ f"Error downloading {tag} dataset from {url}: {e}"
+ ) from e
+ finally:
+ await http_client.close()
+
+ try:
cache_filepath.parent.mkdir(parents=True, exist_ok=True)
cache_filepath.write_text(dataset)
except Exception as e:
raise DatasetLoaderError(
f"Error saving {tag} dataset to local cache: {e}"
) from e
- finally:
- await http_client.close()
return cache_filepath🧰 Tools
🪛 Ruff (0.14.14)
[warning] 314-316: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
In `@src/aiperf/dataset/dataset_manager.py` around lines 297 - 320, In
_download_public_dataset, the current try block wraps both the HTTP call
(AioHttpClient.get_request) and file write (cache_filepath.write_text) so
failures during download produce a misleading "Error saving ..." message; split
the operations into two try/excepts: first await http_client.get_request(...)
and catch exceptions to raise a DatasetLoaderError like "Error downloading {tag}
dataset: {e}", then perform cache_filepath.parent.mkdir(...) and
cache_filepath.write_text(...) in a separate try/except that raises "Error
saving {tag} dataset to local cache: {e}"; keep the finally block that awaits
http_client.close() so the client is always closed.
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
Summary
Closes #628
ShareGPTLoaderfromBasePublicDatasetLoadertoBaseFileLoader, separating download/caching from parsing so public datasets reuse the same file-based loader pattern as local datasetsDatasetManager._download_public_dataset().Summary by CodeRabbit
New Features
Deprecations
Improvements