Priority Level
High (Major improvement)
Is your feature request related to a problem? Please describe.
When a long-running build() call is interrupted (machine crash, OOM kill, network failure), the user has to restart generation from scratch — even though completed batches are already written to disk.
Worse, when the user reruns the same script, resolved_dataset_name in ArtifactStorage detects the existing folder and silently creates a new timestamped directory, so previous partial results are orphaned and invisible:
# artifact_storage.py:66-76
@cached_property
def resolved_dataset_name(self) -> str:
dataset_path = self.artifact_path / self.dataset_name
if dataset_path.exists() and len(list(dataset_path.iterdir())) > 0:
new_dataset_name = f"{self.dataset_name}_{datetime.now().strftime('%m-%d-%Y_%H%M%S')}"
return new_dataset_name # starts fresh instead of resuming
Describe the solution you'd like
Add a resume: bool = False parameter to DatasetBuilder.build() and the public DataDesigner interface.
When resume=True:
- Skip the timestamped-folder logic — use the existing dataset directory as-is
- Read
num_completed_batches from the already-present metadata.json
- Skip already-completed batches in the generation loop (
dataset_builder.py:181):
# current
for batch_idx in range(self.batch_manager.num_batches):
...
# with resume
for batch_idx in range(num_completed_batches, self.batch_manager.num_batches):
...
- Drop any partial batch in
tmp-partial-parquet-files/ at crash time — simpler and safer than trying to merge incomplete data
- Validate that
num_records and buffer_size passed by the user match metadata.json, and raise a clear error if not
Expected usage:
dd = DataDesigner(...)
dd.add_column(...)
# First run — interrupted at batch 7 of 20
results = dd.build(num_records=10_000)
# After restart — picks up from batch 8
results = dd.build(num_records=10_000, resume=True)
Describe alternatives you've considered
- Manual merging — load already-generated parquet files and concatenate with a fresh run for the missing rows. Works but puts the burden entirely on the user, is error-prone, and requires knowing exactly which rows were completed.
- Automatic detection without a flag — detect the existing directory and resume automatically. Rejected because it removes user intent: a user who wants a clean re-run would be surprised by silent resumption.
Agent Investigation
Agent explored the codebase and confirmed:
metadata.json is written after every completed batch (dataset_batch_manager.py:89-100) and already contains num_completed_batches, target_num_records, actual_num_records, buffer_size
- Completed batches are durably stored in
parquet-files/batch_{N:05d}.parquet before metadata is updated
- An in-progress batch at crash time may be in
tmp-partial-parquet-files/
- No code path currently reads
num_completed_batches on startup to skip completed work
- The async engine path (
_build_async, dataset_builder.py:256) would need the same treatment
- Seed generators (
can_generate_from_scratch) should be audited to confirm they produce deterministic output for the same batch index — required for correctness of a resumed run
- No existing issue or PR covers this feature (searched: resume, checkpoint, restart, interrupt, recover, partial)
Additional context
This is particularly relevant for:
- Long pipelines with multiple LLM-generated columns where a late-stage timeout (e.g. validation step) causes total loss of earlier columns' work
- Cloud/HPC environments where preemption is common
- Large datasets (tens of thousands of records) where restarting from scratch is costly
Checklist
Priority Level
High (Major improvement)
Is your feature request related to a problem? Please describe.
When a long-running
build()call is interrupted (machine crash, OOM kill, network failure), the user has to restart generation from scratch — even though completed batches are already written to disk.Worse, when the user reruns the same script,
resolved_dataset_nameinArtifactStoragedetects the existing folder and silently creates a new timestamped directory, so previous partial results are orphaned and invisible:Describe the solution you'd like
Add a
resume: bool = Falseparameter toDatasetBuilder.build()and the publicDataDesignerinterface.When
resume=True:num_completed_batchesfrom the already-presentmetadata.jsondataset_builder.py:181):tmp-partial-parquet-files/at crash time — simpler and safer than trying to merge incomplete datanum_recordsandbuffer_sizepassed by the user matchmetadata.json, and raise a clear error if notExpected usage:
Describe alternatives you've considered
Agent Investigation
Agent explored the codebase and confirmed:
metadata.jsonis written after every completed batch (dataset_batch_manager.py:89-100) and already containsnum_completed_batches,target_num_records,actual_num_records,buffer_sizeparquet-files/batch_{N:05d}.parquetbefore metadata is updatedtmp-partial-parquet-files/num_completed_batcheson startup to skip completed work_build_async,dataset_builder.py:256) would need the same treatmentcan_generate_from_scratch) should be audited to confirm they produce deterministic output for the same batch index — required for correctness of a resumed runAdditional context
This is particularly relevant for:
Checklist