Skip to content

async engine: side-effect columns (e.g. extract_reasoning_content) not written to buffer, causing downstream template failures #549

@przemekboruta

Description

@przemekboruta

Priority Level

High (Major functionality broken)

Describe the bug

When using DATA_DESIGNER_ASYNC_ENGINE=1, columns that depend on a side-effect column (e.g. assistant__reasoning_content generated by LLMTextColumnConfig with extract_reasoning_content=True) fail with a template rendering error claiming the column is missing — even though the DAG dependency is correctly resolved.

Root cause: AsyncTaskScheduler._run_cell (and _run_batch) filters the buffer write by instance_to_columns, which is built from the generators dict. That dict only maps primary column names (e.g. "assistant") to their generator — it never includes side-effect column names (e.g. "assistant__reasoning_content"). As a result, the side-effect values returned by the generator are silently discarded and never reach the row buffer. When a downstream column (e.g. reasoning_pl) runs, its template fails because the side-effect key is absent from the row dict.

The ExecutionGraph correctly resolves side-effect dependencies: reasoning_pl depends on assistant__reasoning_content which is mapped to producer assistant, so the edge assistant → reasoning_pl exists. The completion tracking is also correct — reasoning_pl only becomes ready after assistant finishes. The only broken step is the buffer write.

Steps/Code to reproduce bug

# Run with DATA_DESIGNER_ASYNC_ENGINE=1
from data_designer.config.column_configs import LLMTextColumnConfig
from data_designer.config.config_builder import DataDesignerConfigBuilder

builder = DataDesignerConfigBuilder(model_configs=[...])
builder.with_seed_dataset(seed_source)

builder.add_column(
    LLMTextColumnConfig(
        name="assistant",
        model_alias="my-model",
        prompt="{{ user }}",
        extract_reasoning_content=True,  # produces assistant__reasoning_content
    )
)

builder.add_column(
    LLMTextColumnConfig(
        name="reasoning_pl",
        model_alias="my-model",
        prompt="Translate: {{ assistant__reasoning_content }}",  # ← fails
    )
)

Error observed:

[ERROR] 🛑 There was an error preparing the user prompt template.
The following ['assistant__reasoning_content'] columns are missing!
{
  "column_name": "reasoning_pl",
  "column_type": "llm-text",
  ...
}
[WARNING] Non-retryable failure on reasoning_pl[rg=0, row=5]: ...

Expected behavior

reasoning_pl should receive assistant__reasoning_content in its row context and render the template successfully, producing a translated response for each row.

Agent Diagnostic / Prior Investigation

Traced through AsyncTaskScheduler._run_cell (async_scheduler.py):

# Current (broken) — only writes columns listed in instance_to_columns:
output_cols = self._instance_to_columns.get(id(generator), [task.column])
for col in output_cols:
    if col in result:
        self._buffer_manager.update_cell(task.row_group, task.row_index, col, result[col])

instance_to_columns is built in __init__ by iterating over the generators dict keys — which are only primary column names. Side-effect columns (returned by generator.agenerate() but not registered as dict keys) are present in result but never written to the buffer.

Fix: write all keys returned by the generator to the buffer instead of filtering by output_cols. The completion-tracking use of output_cols (for mark_cell_complete) is unaffected. Same fix applies to _run_batch.

Additional context

  • Affects only DATA_DESIGNER_ASYNC_ENGINE=1. The synchronous engine is not affected.
  • Also affects with_trace side-effect columns ({name}__trace), though extract_reasoning_content is the most common trigger.
  • The same bug exists in _run_batch for full-column generators with side effects.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions