Skip to content

feat: resume interrupted dataset generation runs #525

@przemekboruta

Description

@przemekboruta

Priority Level

High (Major improvement)

Is your feature request related to a problem? Please describe.

When a long-running build() call is interrupted (machine crash, OOM kill, network failure), the user has to restart generation from scratch — even though completed batches are already written to disk.

Worse, when the user reruns the same script, resolved_dataset_name in ArtifactStorage detects the existing folder and silently creates a new timestamped directory, so previous partial results are orphaned and invisible:

# artifact_storage.py:66-76
@cached_property
def resolved_dataset_name(self) -> str:
    dataset_path = self.artifact_path / self.dataset_name
    if dataset_path.exists() and len(list(dataset_path.iterdir())) > 0:
        new_dataset_name = f"{self.dataset_name}_{datetime.now().strftime('%m-%d-%Y_%H%M%S')}"
        return new_dataset_name  # starts fresh instead of resuming

Describe the solution you'd like

Add a resume: bool = False parameter to DatasetBuilder.build() and the public DataDesigner interface.

When resume=True:

  1. Skip the timestamped-folder logic — use the existing dataset directory as-is
  2. Read num_completed_batches from the already-present metadata.json
  3. Skip already-completed batches in the generation loop (dataset_builder.py:181):
# current
for batch_idx in range(self.batch_manager.num_batches):
    ...

# with resume
for batch_idx in range(num_completed_batches, self.batch_manager.num_batches):
    ...
  1. Drop any partial batch in tmp-partial-parquet-files/ at crash time — simpler and safer than trying to merge incomplete data
  2. Validate that num_records and buffer_size passed by the user match metadata.json, and raise a clear error if not

Expected usage:

dd = DataDesigner(...)
dd.add_column(...)

# First run — interrupted at batch 7 of 20
results = dd.build(num_records=10_000)

# After restart — picks up from batch 8
results = dd.build(num_records=10_000, resume=True)

Describe alternatives you've considered

  • Manual merging — load already-generated parquet files and concatenate with a fresh run for the missing rows. Works but puts the burden entirely on the user, is error-prone, and requires knowing exactly which rows were completed.
  • Automatic detection without a flag — detect the existing directory and resume automatically. Rejected because it removes user intent: a user who wants a clean re-run would be surprised by silent resumption.

Agent Investigation

Agent explored the codebase and confirmed:

  • metadata.json is written after every completed batch (dataset_batch_manager.py:89-100) and already contains num_completed_batches, target_num_records, actual_num_records, buffer_size
  • Completed batches are durably stored in parquet-files/batch_{N:05d}.parquet before metadata is updated
  • An in-progress batch at crash time may be in tmp-partial-parquet-files/
  • No code path currently reads num_completed_batches on startup to skip completed work
  • The async engine path (_build_async, dataset_builder.py:256) would need the same treatment
  • Seed generators (can_generate_from_scratch) should be audited to confirm they produce deterministic output for the same batch index — required for correctness of a resumed run
  • No existing issue or PR covers this feature (searched: resume, checkpoint, restart, interrupt, recover, partial)

Additional context

This is particularly relevant for:

  • Long pipelines with multiple LLM-generated columns where a late-stage timeout (e.g. validation step) causes total loss of earlier columns' work
  • Cloud/HPC environments where preemption is common
  • Large datasets (tens of thousands of records) where restarting from scratch is costly

Checklist

  • I've reviewed existing issues and the documentation
  • This is a design proposal, not a "please build this" request

Metadata

Metadata

Assignees

No one assigned

    Labels

    triagedIssue reviewed and approved by a maintainer

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions