Priority Level
High (Major functionality broken)
Describe the bug
When using DATA_DESIGNER_ASYNC_ENGINE=1, columns that depend on a side-effect column (e.g. assistant__reasoning_content generated by LLMTextColumnConfig with extract_reasoning_content=True) fail with a template rendering error claiming the column is missing — even though the DAG dependency is correctly resolved.
Root cause: AsyncTaskScheduler._run_cell (and _run_batch) filters the buffer write by instance_to_columns, which is built from the generators dict. That dict only maps primary column names (e.g. "assistant") to their generator — it never includes side-effect column names (e.g. "assistant__reasoning_content"). As a result, the side-effect values returned by the generator are silently discarded and never reach the row buffer. When a downstream column (e.g. reasoning_pl) runs, its template fails because the side-effect key is absent from the row dict.
The ExecutionGraph correctly resolves side-effect dependencies: reasoning_pl depends on assistant__reasoning_content which is mapped to producer assistant, so the edge assistant → reasoning_pl exists. The completion tracking is also correct — reasoning_pl only becomes ready after assistant finishes. The only broken step is the buffer write.
Steps/Code to reproduce bug
# Run with DATA_DESIGNER_ASYNC_ENGINE=1
from data_designer.config.column_configs import LLMTextColumnConfig
from data_designer.config.config_builder import DataDesignerConfigBuilder
builder = DataDesignerConfigBuilder(model_configs=[...])
builder.with_seed_dataset(seed_source)
builder.add_column(
LLMTextColumnConfig(
name="assistant",
model_alias="my-model",
prompt="{{ user }}",
extract_reasoning_content=True, # produces assistant__reasoning_content
)
)
builder.add_column(
LLMTextColumnConfig(
name="reasoning_pl",
model_alias="my-model",
prompt="Translate: {{ assistant__reasoning_content }}", # ← fails
)
)
Error observed:
[ERROR] 🛑 There was an error preparing the user prompt template.
The following ['assistant__reasoning_content'] columns are missing!
{
"column_name": "reasoning_pl",
"column_type": "llm-text",
...
}
[WARNING] Non-retryable failure on reasoning_pl[rg=0, row=5]: ...
Expected behavior
reasoning_pl should receive assistant__reasoning_content in its row context and render the template successfully, producing a translated response for each row.
Agent Diagnostic / Prior Investigation
Traced through AsyncTaskScheduler._run_cell (async_scheduler.py):
# Current (broken) — only writes columns listed in instance_to_columns:
output_cols = self._instance_to_columns.get(id(generator), [task.column])
for col in output_cols:
if col in result:
self._buffer_manager.update_cell(task.row_group, task.row_index, col, result[col])
instance_to_columns is built in __init__ by iterating over the generators dict keys — which are only primary column names. Side-effect columns (returned by generator.agenerate() but not registered as dict keys) are present in result but never written to the buffer.
Fix: write all keys returned by the generator to the buffer instead of filtering by output_cols. The completion-tracking use of output_cols (for mark_cell_complete) is unaffected. Same fix applies to _run_batch.
Additional context
- Affects only
DATA_DESIGNER_ASYNC_ENGINE=1. The synchronous engine is not affected.
- Also affects
with_trace side-effect columns ({name}__trace), though extract_reasoning_content is the most common trigger.
- The same bug exists in
_run_batch for full-column generators with side effects.
Priority Level
High (Major functionality broken)
Describe the bug
When using
DATA_DESIGNER_ASYNC_ENGINE=1, columns that depend on a side-effect column (e.g.assistant__reasoning_contentgenerated byLLMTextColumnConfigwithextract_reasoning_content=True) fail with a template rendering error claiming the column is missing — even though the DAG dependency is correctly resolved.Root cause:
AsyncTaskScheduler._run_cell(and_run_batch) filters the buffer write byinstance_to_columns, which is built from thegeneratorsdict. That dict only maps primary column names (e.g."assistant") to their generator — it never includes side-effect column names (e.g."assistant__reasoning_content"). As a result, the side-effect values returned by the generator are silently discarded and never reach the row buffer. When a downstream column (e.g.reasoning_pl) runs, its template fails because the side-effect key is absent from the row dict.The
ExecutionGraphcorrectly resolves side-effect dependencies:reasoning_pldepends onassistant__reasoning_contentwhich is mapped to producerassistant, so the edgeassistant → reasoning_plexists. The completion tracking is also correct —reasoning_plonly becomes ready afterassistantfinishes. The only broken step is the buffer write.Steps/Code to reproduce bug
Error observed:
Expected behavior
reasoning_plshould receiveassistant__reasoning_contentin its row context and render the template successfully, producing a translated response for each row.Agent Diagnostic / Prior Investigation
Traced through
AsyncTaskScheduler._run_cell(async_scheduler.py):instance_to_columnsis built in__init__by iterating over thegeneratorsdict keys — which are only primary column names. Side-effect columns (returned bygenerator.agenerate()but not registered as dict keys) are present inresultbut never written to the buffer.Fix: write all keys returned by the generator to the buffer instead of filtering by
output_cols. The completion-tracking use ofoutput_cols(formark_cell_complete) is unaffected. Same fix applies to_run_batch.Additional context
DATA_DESIGNER_ASYNC_ENGINE=1. The synchronous engine is not affected.with_traceside-effect columns ({name}__trace), thoughextract_reasoning_contentis the most common trigger._run_batchfor full-column generators with side effects.