feat: resume interrupted dataset generation runs

### Priority Level

High (Major improvement)

### Is your feature request related to a problem? Please describe.

When a long-running `build()` call is interrupted (machine crash, OOM kill, network failure), the user has to restart generation from scratch — even though completed batches are already written to disk.

Worse, when the user reruns the same script, `resolved_dataset_name` in `ArtifactStorage` detects the existing folder and **silently creates a new timestamped directory**, so previous partial results are orphaned and invisible:

```python
# artifact_storage.py:66-76
@cached_property
def resolved_dataset_name(self) -> str:
    dataset_path = self.artifact_path / self.dataset_name
    if dataset_path.exists() and len(list(dataset_path.iterdir())) > 0:
        new_dataset_name = f"{self.dataset_name}_{datetime.now().strftime('%m-%d-%Y_%H%M%S')}"
        return new_dataset_name  # starts fresh instead of resuming
```

### Describe the solution you'd like

Add a `resume: bool = False` parameter to `DatasetBuilder.build()` and the public `DataDesigner` interface.

When `resume=True`:

1. **Skip the timestamped-folder logic** — use the existing dataset directory as-is
2. **Read `num_completed_batches`** from the already-present `metadata.json`
3. **Skip already-completed batches** in the generation loop (`dataset_builder.py:181`):

```python
# current
for batch_idx in range(self.batch_manager.num_batches):
    ...

# with resume
for batch_idx in range(num_completed_batches, self.batch_manager.num_batches):
    ...
```

4. **Drop any partial batch** in `tmp-partial-parquet-files/` at crash time — simpler and safer than trying to merge incomplete data
5. Validate that `num_records` and `buffer_size` passed by the user match `metadata.json`, and raise a clear error if not

Expected usage:

```python
dd = DataDesigner(...)
dd.add_column(...)

# First run — interrupted at batch 7 of 20
results = dd.build(num_records=10_000)

# After restart — picks up from batch 8
results = dd.build(num_records=10_000, resume=True)
```

### Describe alternatives you've considered

- **Manual merging** — load already-generated parquet files and concatenate with a fresh run for the missing rows. Works but puts the burden entirely on the user, is error-prone, and requires knowing exactly which rows were completed.
- **Automatic detection without a flag** — detect the existing directory and resume automatically. Rejected because it removes user intent: a user who wants a clean re-run would be surprised by silent resumption.

### Agent Investigation

Agent explored the codebase and confirmed:

- `metadata.json` is written after **every completed batch** (`dataset_batch_manager.py:89-100`) and already contains `num_completed_batches`, `target_num_records`, `actual_num_records`, `buffer_size`
- Completed batches are durably stored in `parquet-files/batch_{N:05d}.parquet` before metadata is updated
- An in-progress batch at crash time may be in `tmp-partial-parquet-files/`
- No code path currently reads `num_completed_batches` on startup to skip completed work
- The async engine path (`_build_async`, `dataset_builder.py:256`) would need the same treatment
- Seed generators (`can_generate_from_scratch`) should be audited to confirm they produce deterministic output for the same batch index — required for correctness of a resumed run
- No existing issue or PR covers this feature (searched: resume, checkpoint, restart, interrupt, recover, partial)

### Additional context

This is particularly relevant for:
- Long pipelines with multiple LLM-generated columns where a late-stage timeout (e.g. validation step) causes total loss of earlier columns' work
- Cloud/HPC environments where preemption is common
- Large datasets (tens of thousands of records) where restarting from scratch is costly

### Checklist

- [x] I've reviewed existing issues and the documentation
- [x] This is a design proposal, not a "please build this" request

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: resume interrupted dataset generation runs #525

Priority Level

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Agent Investigation

Additional context

Checklist

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat: resume interrupted dataset generation runs #525

Description

Priority Level

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Agent Investigation

Additional context

Checklist

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions