🤖 feat: add best-of-n support for sub-agents#2916
Conversation
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 70f03ce494
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review Please take another look. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 271e0d7fc0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review Please take another look. |
1 similar comment
|
@codex review Please take another look. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: dd60327c27
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review Please take another look. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f236d7b3dd
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
f236d7b to
9b8a148
Compare
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 67b5557f5c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review Please take another look. |
|
Codex Review: Didn't find any major issues. 🚀 ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3624e33604
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review |
1 similar comment
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 492bf42ee9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fb323fb1a1
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7fc181773f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: be6b8b3fb5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: baebaef3e0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review |
Handle partial-spawn best-of counts in the parent task UI and avoid suppressing fallback best-of reports when interrupted grouped recovery can no longer finalize. --- _Generated with `mux` • Model: `openai:gpt-5.4` • Thinking: `xhigh` • Cost: `$104.87`_ <!-- mux-attribution: model=openai:gpt-5.4 thinking=xhigh costs=104.87 -->
Only defer grouped best-of fallback while a single pending parent task call is still recoverable, so malformed interrupted partial state still falls back to synthetic parent reports. --- _Generated with `mux` • Model: `openai:gpt-5.4` • Thinking: `xhigh` • Cost: `$104.87`_ <!-- mux-attribution: model=openai:gpt-5.4 thinking=xhigh costs=104.87 -->
Require better best-of recovery discrimination in the parent task UI, keep recovered groups stable once matched, and remove the unnecessary manual memoization wrapper from sidebar expansion state. --- _Generated with `mux` • Model: `openai:gpt-5.4` • Thinking: `xhigh` • Cost: `$104.87`_ <!-- mux-attribution: model=openai:gpt-5.4 thinking=xhigh costs=104.87 -->
Keep grouped best-of task runs observable in terminal workflows by summarizing grouped running/completed task outputs in the CLI formatter. --- _Generated with `mux` • Model: `openai:gpt-5.4` • Thinking: `xhigh` • Cost: `$104.87`_ <!-- mux-attribution: model=openai:gpt-5.4 thinking=xhigh costs=104.87 -->
Only bind executing best-of task cards to child groups after concrete task IDs arrive, and update the UI tests to drive the same task-created event path used in production. --- _Generated with `mux` • Model: `openai:gpt-5.4` • Thinking: `xhigh` • Cost: `$104.87`_ <!-- mux-attribution: model=openai:gpt-5.4 thinking=xhigh costs=104.87 -->
Use the parent tool message timestamp as the best-of recovery discriminator when no task-created IDs are available, and seed the UI tests with realistic tool timestamps. --- _Generated with `mux` • Model: `openai:gpt-5.4` • Thinking: `xhigh` • Cost: `$104.87`_ <!-- mux-attribution: model=openai:gpt-5.4 thinking=xhigh costs=104.87 -->
Preserve grouped taskIds/tasks when a best-of spawn stops after a single candidate so downstream UIs still retain the 1-of-N batching context. --- _Generated with `mux` • Model: `openai:gpt-5.4` • Thinking: `xhigh` • Cost: `$104.87`_ <!-- mux-attribution: model=openai:gpt-5.4 thinking=xhigh costs=104.87 -->
When a grouped best-of recovery later becomes impossible because a sibling interrupts without reporting, proactively deliver deferred sibling reports back to the parent conversation. --- _Generated with `mux` • Model: `openai:gpt-5.4` • Thinking: `xhigh` • Cost: `$104.87`_ <!-- mux-attribution: model=openai:gpt-5.4 thinking=xhigh costs=104.87 -->
Keep interrupted best-of task headers honest and stop deferring grouped fallback/cleanup when unrelated pending task calls make grouped partial finalization impossible. --- _Generated with `mux` • Model: `openai:gpt-5.4` • Thinking: `xhigh` • Cost: `$104.87`_ <!-- mux-attribution: model=openai:gpt-5.4 thinking=xhigh costs=104.87 -->
Skip best-of siblings whose synthetic fallback reports were already appended when replaying deferred reports after later sibling interruptions. --- _Generated with `mux` • Model: `openai:gpt-5.4` • Thinking: `xhigh` • Cost: `$104.87`_ <!-- mux-attribution: model=openai:gpt-5.4 thinking=xhigh costs=104.87 -->
Represent interrupted best-of siblings in partial task results and retry deferred fallback delivery when parent streams end or sibling interruptions make grouped finalization impossible. --- _Generated with `mux` • Model: `openai:gpt-5.4` • Thinking: `xhigh` • Cost: `$104.87`_ <!-- mux-attribution: model=openai:gpt-5.4 thinking=xhigh costs=104.87 -->
Finalize ready parent best-of task partials before deferred cleanup, add a regression test for that restart-safe recovery path, and simplify several low-risk best-of UI/backend helpers. --- _Generated with `mux` • Model: `openai:gpt-5.4` • Thinking: `xhigh` • Cost: `$231.83`_ <!-- mux-attribution: model=openai:gpt-5.4 thinking=xhigh costs=231.83 -->
Finalize pending best-of parent tool calls during startup recovery, avoid rebinding historical best-of cards to later matching groups, and cover both regressions with focused tests. --- _Generated with `mux` • Model: `openai:gpt-5.4` • Thinking: `xhigh` • Cost: `$231.83`_ <!-- mux-attribution: model=openai:gpt-5.4 thinking=xhigh costs=231.83 -->
Serialize parent-scoped deferred best-of fallback/finalization work so concurrent child stream-end handlers cannot append duplicate synthetic subagent reports, and add a regression test covering that race. --- _Generated with `mux` • Model: `openai:gpt-5.4` • Thinking: `xhigh` • Cost: `$231.83`_ <!-- mux-attribution: model=openai:gpt-5.4 thinking=xhigh costs=231.83 -->
Use the parent-scoped deferred best-of lock for direct reported-child delivery as well as deferred fallback delivery so concurrent reported/interrupted sibling completion cannot append duplicate synthetic reports. --- _Generated with `mux` • Model: `openai:gpt-5.4` • Thinking: `xhigh` • Cost: `$231.83`_ <!-- mux-attribution: model=openai:gpt-5.4 thinking=xhigh costs=231.83 -->
Move sibling cleanup rechecks out of the parent-scoped best-of delivery lock so concurrent child stream-end handlers cannot deadlock on parent and child cleanup locks. --- _Generated with `mux` • Model: `openai:gpt-5.4` • Thinking: `xhigh` • Cost: `$231.83`_ <!-- mux-attribution: model=openai:gpt-5.4 thinking=xhigh costs=231.83 -->
Only run deferred best-of parent recovery when a pending best-of task partial actually exists, and add coverage so completed grouped task results do not append duplicate synthetic reports on later parent stream-end rechecks. --- _Generated with `mux` • Model: `openai:gpt-5.4` • Thinking: `xhigh` • Cost: `$231.83`_ <!-- mux-attribution: model=openai:gpt-5.4 thinking=xhigh costs=231.83 -->
Resolve the pending parent best-of group before deferred recovery runs so older stale groups under the same parent cannot finalize the current pending task tool call or emit duplicate fallback reports. --- _Generated with `mux` • Model: `openai:gpt-5.4` • Thinking: `xhigh` • Cost: `$231.83`_ <!-- mux-attribution: model=openai:gpt-5.4 thinking=xhigh costs=231.83 -->
2275f14 to
7381b95
Compare
|
@codex review Rebased this branch onto |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7381b9538a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review Addressed the stale single-group recovery issue, clarified best-of delegation guidance, and removed tautological constant-only assertions from |
|
Codex Review: Didn't find any major issues. Another round soon, please! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
@codex review Factored the repetitive best-of scaffolding in |
|
Codex Review: Didn't find any major issues. Swish! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
Summary
Add best-of-n support to sub-agent spawning, coalesce grouped runs in the transcript and sidebar, clarify that parent agents should do only brief setup before delegating user-requested best-of batches, and harden the grouped recovery paths so interrupted, restarted, or historical parent task calls still render and finalize cleanly.
Background
The
tasktool previously spawned a single sub-agent per call, which made best-of exploration awkward and noisy in the UI. Adding first-class batching helped, but the follow-up review cycle exposed several edge cases around grouped task completion, deferred fallback delivery, startup recovery, transcript rebinding after old child workspaces had already been cleaned up, concurrent child stream-end handlers racing to deliver deferred fallback reports, stale older best-of groups being able to satisfy a newer pending parent partial after restart, and the readability cost of repetitive best-of task-service regression scaffolding.Implementation
nparameter to thetasktool, defaulting to1and validating the allowed1–20rangensibling child tasks when requested, persist shared best-of metadata on those workspaces, and return grouped task metadata/reports from the tooltasktool guidance so user-requested best-of runs keep the parent focused on brief setup and synthesis instead of duplicating the full child analysis in paralleltaskService.test.tssetup/report helpers so the regression coverage stays behavior-oriented without re-embedding the same child stream-end scaffolding in every casetask.test.tsValidation
make static-checkbun x jest tests/ui/tasks/bestOfProgress.test.ts --runInBandbun x jest tests/ui/tasks/awaitVisualization.test.tsbun test src/browser/components/ProjectSidebar/ProjectSidebar.test.tsx --test-name-pattern 'best-of|Best-of|leaf'bun test src/node/services/tools/task.test.tsbun test src/node/services/taskService.test.ts --test-name-pattern 'agent_report waits for all best-of reports|partial best-of spawn failure|duplicate synthetic parent reports|best-of'bun test src/node/services/taskService.test.ts --test-name-pattern 'stale single best-of group|targets the pending best-of group|finalizes ready best-of partials before cleanup rechecks|initialize finalizes ready best-of partials before cleanup rechecks'bun test src/node/services/taskService.test.ts --test-name-pattern 'best-of|Best-of|cleanup rechecks|concurrent deferred best-of fallback delivery does not duplicate synthetic reports|finalizes ready best-of partials before cleanup rechecks|initialize finalizes ready best-of partials before cleanup rechecks'Risks
This touches task tool result shapes, model-facing delegation guidance, restart-safe partial recovery, deferred fallback delivery, startup cleanup ordering, parent-scoped recovery locking, sidebar/chat rendering for child tasks, and the regression harnesses that cover those flows. The highest regression risk is around grouped task completion diverging from single-task behavior or stale persisted groups being rebound to the wrong pending parent call, so the change is covered with targeted schema, tool, task-service, transcript, and sidebar tests in addition to
make static-check.Generated with
mux• Model:openai:gpt-5.4• Thinking:xhigh• Cost:$260.21